Media content mixing apparatuses, methods and systems

ABSTRACT

In aspects, systems, methods, apparatuses and computer-readable storage media implementing embodiments for mixing audio content based on a plurality of user generated recordings (UGRs) are disclosed. In embodiments, the mixing comprises: receiving a plurality of UGRs, each UGR of the plurality of UGRs comprising at least audio content; determining a correlation between samples of audio content associated with at least two UGRs of the plurality of UGRs; generating one or more clusters comprising samples of the audio content identified as having a relationship based on the determined correlations; synchronizing, for each of the one or more clusters, the samples of the audio content to produce synchronized audio content for each of the one or more clusters, normalizing, for each of the one or more clusters, the synchronized audio content to produce normalized audio content; and mixing, for each of the one or more clusters, the normalized audio content.

CROSS-REFERENCE TO RELATED APPLICATIONS

The present application claims the benefit of U.S. Provisional PatentApplication No. 62/517,004 entitled, “AUDIO SAMPLE MIXING APPARATUSES,METHODS, AND SYSTEMS,” filed Jun. 8, 2017, which is expresslyincorporated by reference herein in its entirety.

GOVERNMENT FUNDED RESEARCH

The project leading to this application has received funding from theEuropean Union's Horizon 2020 Research and Innovation Programme underGrant Agreement No. 687605.

TECHNICAL FIELD

The present subject matter is directed generally to apparatuses,methods, and systems for acoustic signal processing, and moreparticularly, to AUDIO SAMPLE MIXING APPARATUSES, METHODS, AND SYSTEMS.

BACKGROUND

With the proliferation of portable multimedia devices, drones, andsmartphones, devices capable of capturing every moment of theirrespective users' lives and the events they attend, such as concerts,sporting events, family celebrations, and the like, have becomewidespread. Audiovisual recordings from these devices, produced by usersattending the same event, may become available through the social mediaplatforms and other media outlets where the users may submit/publishtheir video and audio content. The availability of such massive amountsof User Generated Recordings (UGRs) has triggered new researchdirections related to the search, organization, and management of thiscontent, and has provided inspiration for new business models forcontent storage, retrieval, and consumption.

SUMMARY

The present disclosure is directed to systems, methods, andcomputer-readable storage media that facilitate new techniques fororganizing, managing, and utilizing UGRs. More specifically, embodimentsof the present disclosure provide various techniques for utilizingavailable UGR content to create new media content. For example, given acollection of UGRs, which may include audio content, video content,audio and video content, and the like, several approaches have beenproposed about how to exploit the available visual and audio content—aswell as several types of metadata—in order to identify video clipsassociated to the same moment of the captured event and to synchronizethem along the same temporal axis. The audio content is a key to solvingthis problem and several works have shown that the relations betweendifferent UGRs can be revealed by exploiting the correlations in theirassociated audio streams.

An emerging research challenge is to investigate different means bywhich this low-quality but organized content can be synergisticallyprocessed and combined, so as to produce a new audiovisual sequencewhich provides an improved experience of the captured event. Thepotential is particularly interesting with respect to the audiomodality, as a multitude of synchronized audio recordings essentiallyprovides a multichannel acoustic representation of the event. Bycombining the different sources of content, it is possible to constructa new audio stream with improved properties in comparison to each one ofits constituent parts. However, several preparation steps are requiredbefore reaching to the point that the different sources of content canbe mixed.

In this application we present several technological advancementsrelated to processing of UGC and particularly audio and videorecordings. They are all displayed in FIG. 1. We start by presenting anapproach to estimate the match-strength and the temporal offset betweentwo video or audio recordings which is applicable regardless of whattype of audio or visual features are used, as long as these features areextracted at regular time intervals. Based on all pairwise relationsinvolving a collection of UGRs, we then propose a methodology forseparating the recordings into clusters and then for synchronizing allmembers of each cluster along the same time-axis. This operation is veryimportant as it allows for synchronous playback of the content,regardless if this concerns the audio or the visual component in thevideo streams. We then define an iterative normalization process tomodify the signal levels of the audio recordings, a step that maysignificantly improve the outcome of audio mixing. Finally, we presentfour different audio mixing techniques which are applicable on any typeof content (audio or video). We note that each contribution isindependent to the other and does not depend on the approach followed inthe previous step. For example, our hierarchical clustering approachdoes not depend on the type of similarity measure used in order toassess the relations between UGRs and each mixing approach is applicableirrespective of what approach was used for synchronizing or normalizingthe audio recordings, if any such approach was used.

The foregoing has outlined rather broadly the features and technicaladvantages of the present disclosure in order that the detaileddescription that follows may be better understood. Additional featuresand advantages will be described hereinafter which form the subject ofthe claims. It should be appreciated by those skilled in the art thatthe conception and specific embodiment disclosed may be readily utilizedas a basis for modifying or designing other structures for carrying outthe same purposes of the present disclosure. It should also be realizedby those skilled in the art that such equivalent constructions do notdepart from the scope of the present disclosure as set forth in theappended claims. The novel features which are believed to becharacteristic of embodiments described herein, both as to itsorganization and method of operation, together with further objects andadvantages will be better understood from the following writtendescription when considered in connection with the accompanying figures.It is to be expressly understood, however, that each of the figures isprovided for the purpose of illustration and description only and is notintended as a definition of the limits of the present disclosure.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings illustrate various non-limiting, example,inventive aspects of audio sampling apparatuses, methods, and systems.

FIG. 1 is a flow diagram illustrating exemplary aspects of a method formixing a plurality of UGRs according to embodiments of the presentdisclosure;

FIG. 2 shows a diagram illustrating aspects of time aligning UGRs alonga common time axis;

FIG. 3 is a flow diagram illustrating exemplary aspects of anothermethod for mixing a content of plurality of UGRs according toembodiments of the present disclosure;

FIG. 4 is a diagram illustrating aspects of extracting information fromconnected UGRs;

FIG. 5 a diagram illustrating additional aspects of extractinginformation from connected UGRs;

FIG. 6 shows a block diagram illustrating exemplary aspects of a mixingcontroller for mixing audio samples in accordance with embodiments ofthe present disclosure; and

FIG. 7 is a flow diagram of a method for mixing a plurality of UGRs inaccordance with embodiments of the present disclosure.

It should be appreciated by those skilled in the art that any blockdiagrams herein represent conceptual views of illustrative systems.Similarly, it should be appreciated that any flow charts, flow diagrams,state transition diagrams, pseudo code, and the like represent variousprocesses, which may be substantially represented in a computer readablemedium and executed by a computer or processor, whether or not suchcomputer or processor is explicitly shown.

DETAILED DESCRIPTION

An emerging research challenge relates to investigating ways in whichlow-quality but organized media content recordings, such as UGRs, may besynergistically processed and combined to produce a new media contentsequence. The potential is particularly interesting with respect to theaudio modality, as a multitude of synchronized UGRs may be utilized togenerate a multichannel recording of an acoustic event. For example, bycombining different sources of content (e.g., different UGRs), a newaudio stream may be constructed with improved properties in comparisonto each one of its constituent parts. Several preparation steps may beutilized to prepare these different sources of content for mixing. Inthe description that follows, several approaches for exploiting UGRcontent (e.g., visual content, audio content, and metadata) to identifyportions of a set of UGRs that are associated with or relate to the samemoment or portion of a captured event are disclosed. In accordance withthe embodiments disclosed herein, similarities between content ofdifferent UGRs may be utilized to identify relationships betweendifferent UGRs and may also facilitate estimation of an overlap betweencontents of the UGRs. This information may be utilized to synchronizevarious UGRs along the same temporal axis, such that the synchronizedUGRs may be combined to create new sequences of media content, asdescribed in more detail below. In aspects, the term UGR may refer torecordings of audio content. However, it is noted that UGRs processedaccording to embodiments of the present disclosure may also containvisual content (e.g., image and/or video content). In aspects, eachrecording or UGR utilized to create mixed content may have an equalcontribution to the produced mixed content.

Referring to FIG. 1, a flow diagram illustrating exemplary aspects of aprocess for generating media content by mixing a plurality of UGRsaccording to embodiments of the present disclosure is shown as a method100. As shown in FIG. 1, the method 100 includes a first step 110 wherea pairwise relation is determined for a collection of UGRs. In aspects,the collection of UGRs may include UGRs from unknown space and time. Themethod 100 also includes a second step 120, where the UGRs may beseparated into groups or clusters. In aspects, each cluster of UGRs maybe relevant to a particular event or a part of the particular event.After clustering has been completed, the method 100 includes a thirdstep 130 where UGRs in the same group or cluster may be synchronizedalong a time-axis. In aspects, the time-axis may be a unique time-axis.In a fourth step 140, the synchronized UGRs may be normalized, and in afifth step of the method 100, the UGRs may be mixed based on thesynchronization to produce new media content, which may include audiocontent, video content, or both audio content and video content.

Exemplary aspects of each of the steps 110-150 of the method 100 aredescribed in more detail below. In aspects, a fingerprinting algorithmfor calculating the pairwise relations between UGRs and an approach forclustering UGRs based on various types of pairwise similarity measuresbetween the UGRs are disclosed. Additionally, a methodology forsynchronizing UGRs belonging to the same cluster along a same time-axisand an iterative normalization process are also disclosed. Further, aplurality of different mixing techniques are disclosed that exploit amultiplicity of matched, synchronized and normalized UGRs in order toreproduce a new audio stream which combines acoustic informationavailable in the overlapping UGRs at each point in time. In aspects, thehierarchical clustering (e.g., in the second step 120) approach may notdepend on the type of similarity measure used in order to express therelations between UGRs (e.g., in first step 110), and each mixing (e.g.,in the fifth step 150) approach may be applicable irrespective of thenormalization (e.g., in the fourth step 140) approach employed fornormalizing the audio clips.

As briefly described above, a pairwise match strength and time-offsetestimation may be performed at step 110. As a result of step 110,information that provides an indication of the similarity betweendifferent pairs of UGRs and timing information may be generated. Forexample, consider a collection of M UGRs in a particular format (e.g., apulse-code modulation (PCM) format) such that vectors x_(m), m=1, . . ., M contain sampled values of each UGR. In aspects, the UGRs may besampled at a common sampling rate (F_(s)). A similarity measure R_(ij)between any pair of different UGRs i and j may be estimated as follows.First, for each audio file, an N_(m)×1 audio fingerprint vector (F_(m))may be constructed and stored in a memory, where N_(m) is the number oftime-frames used in the analysis of the m^(th) recording (and thusdepends on the duration of that recording). Second for each pair ij, avector containing the values of a generalized cross-correlation functiondenoted as R_(F) _(i) _(F) _(j) (τ) may be calculated, where τϵ

spans all possible time-frame offsets between fingerprints i and j.Finally, the maximum of the cross correlation, which may be given by:

$\begin{matrix}{R_{ij} = {\max\limits_{\tau}{R_{F_{i}F_{j}}(\tau)}}} & (1)\end{matrix}$

which may be utilized as the similarity measure between recordings i andj. As a side product of this process, time-frame difference of arrivals{circumflex over (τ)}_(ij)=arg max R_(F) _(i) _(F) _(j) (τ) may bederived, which may be used to synchronize recordings i and j. Inaspects, the time-frame difference of arrivals may be stored in a memoryas an M×M matrix T such that T₁=τ_(ij) and T_(ij)=−{circumflex over(τ)}_(ij).

The above-identified process does not depend on the particular types offingerprints used for each of the UGRs. In aspects, a method forconstructing a fingerprint vector may take into account energyvariations in a single frequency region. For the fingerprint extractionprocess, an auxiliary signal may be defined, which may be given by:

P _(m) [n]=Σ _(τ=n−L) ^(n+L)Σ_(ω=k) _(LB) ^(k) ^(UB) |X _(m)(τ,ω)|  (2)

where |X_(m)(τ,ω)| is the short-time Fourier transform (STFT)coefficient associated with the τ^(th) time-frame and ω^(th) frequencybin of the audio signal at recording m, k_(LB) and k_(UB) are thefrequency indexes corresponding to lower and upper frequency limitrespectively, and L is a positive integer used for time averaging. Thesub-fingerprint at time n is scalar defined as:

$\begin{matrix}{{F_{m}\lbrack n\rbrack} = \left\{ \begin{matrix}{{1\mspace{14mu} {if}\mspace{14mu} {P_{m}(n)}} \geq {P_{m}\left( {n - 1} \right)}} \\{{{- 1}\mspace{14mu} {if}\mspace{14mu} {P_{m}(n)}} < {P_{m}\left( {n - 1} \right)}}\end{matrix} \right.} & (3)\end{matrix}$

In aspects, the fingerprint vector F_(m) may be a vector with valuesequal to 1 or −1.

With respect to the hierarchical clustering of UGRs at step 120,consider a collection of M UGRs x_(m), m=1, . . . , M. Let M×M matrix Rdenote a matrix containing a measure of all pairwise similaritiesbetween UGRs (i.e., R_(ij)=similarity (x_(i),x_(j))ϵ

+) where the similarity measure may be obtained using, for example,audio signal cross-correlation, or fingerprint cross-correlation, suchas the technique defined above with reference to Equation (2).

From matrix R, another M×M distance matrix D may be constructed suchthat:

$\begin{matrix}{D_{ij} = \frac{1}{R_{ij}}} & (4)\end{matrix}$

Matrix R may provide information about pairwise relations between UGRs,but a higher level or organization for the collection may be beneficialso that recordings originating from the same part of the event (e.g., aparticular song in a concert) may be grouped into a single cluster.Below, an exemplary three step process to determine both a number ofclusters as well as the members of each cluster is described:

(1) Assuming the each UGR represents an observation and that D_(ij),∀i,j:i≠j represents the distance between observations i and j, obtain anagglomerative hierarchical cluster tree linking all M UGRs using asingle-linkage hierarchical clustering method. In aspects, Matlab'sfunction linkage may be used.

(2) The number of clusters and the identities of the UGRs in eachcluster may be estimated by setting a threshold D_(max) and requestingthat the distance that a UGR should exhibit before entering a particularcluster is smaller than D_(max).

(3) A graph consisting of Mnodes may be constructed so that the m^(th)node corresponds to the m^(th) UGR and so that initially, no node isconnected to another node. Based on the results of hierarchicalclustering, only those nodes which are directly linked in the clustertree describing the linkage between the UGRs from step (1) may beconnected in the graph. In aspects, two variations concerning theweights given to the edge connecting two nodes in the graph may exist.In aspects, these weight variations may indicate that the weight betweenconnected nodes is always equal to 1, or that the weight betweenconnected nodes i and j is equal to their distance D_(ij).

In the exemplary operations of step 130 for synchronization of UGRsdescribed below, it may be assumed that a collection of M′≥2 UGRsassigned to the same cluster based on the approach described above(e.g., based on operations described above with respect to steps 110 and120) is identified. Based on step 3 of the second step 120 describedabove, these M′ UGRs may form a graph with M′ nodes and M′−1 edges. Aclip indexed with m′ may be assigned as a reference clip for thiscollection. In aspects, m′ may be the index of any of the two nodesexhibiting the smallest distance D_(ij) (i.e., the two UGRs exhibitingthe strongest similarity value). For this collection of ULGRs, let T bean M′×M′ matrix with an initial synchronization time-frame offset, suchthat T_(ij) is a time-frame offset maximizing the cross-correlationbetween clips i and j, or the time-frame offset which maximizes thecross-correlation between fingerprints associated to clips i and j (thisinformation may be derived based on the analysis of step 110). Inaspects, T_(ij)=−T_(ij) holds. In aspects, the time-frame offsets inmatrix T might be incorrect. In aspects, this may be caused by, forexample, weak cross-correlation between two audio streams, or due to thefact that two UGRs, although in the same cluster, do not temporallyoverlap at all. Based on the graph constructed in the second step 120,let set p(i→m′) denote the shortest path connecting node i withreference node m′. In aspects, this may be calculated by using graphanalysis functionality, such as the Matlab function graphshortestpath.The new sample offset Q_(im), synchronizing UGR i with UGR m′ may bedetermined by summing together all sample offsets specific to the edgesthat form the shortest path from i to m′. For example, if m′=5 and theshortest path connecting UGR 1 with UGR 5 is the set {1, 2, 4, 5}, thenQ_(1,5)=T_(1,2)+T_(2,4)+T_(4,5). Based on this approach, all M′−1 UGRsin the same cluster may be synchronized with respect to the referenceUGR and as a consequence, to one another.

In an aspect, the normalization process of step 140 may be utilized toimprove the quality of media content produced by the mixing step. Toillustrate, a collection of M temporally overlapping UGRs sampled atcommon sampling rate Fs. As users start and stop recording at randomtime instants, and, therefore, each UGR in the collection M may have adifferent duration and position on a timeline. After performingsynchronization, as described above, it may be assumed that all Mrecordings are correctly time aligned along a common time axis.Referring briefly to FIG. 2, a diagram illustrating aspects of timealigning UGRs along a common time axis according to embodiments of thepresent disclosure is shown. In FIG. 2, three mutually overlapping audiorecordings defining six transition points and five constant timesegments are shown, where time points are indexed with n and timesegments with l. In aspects, the time axis may be discretized in termsof a uniform time grid, and each point on the time grid may be indexedby it, as shown in FIG. 2. It is noted that these time points need notcorrespond to signal samples necessarily, as in practice a sparserdiscretization than that may be used for sampling the audio signal insome cases, which may reduce computational complexity. Returning now tooperations of step 130, let n_(m) ^(start) and n_(m) ^(end) denote thestart and end time, respectively, of the m^(th) UGR. Time points n_(m)^(start) and n_(m) ^(end)+1, m=1, . . . M may represent the so-calledtransition points of the plot (e.g., the locations in time where thenumber of audio clips participating in the mix changes). Using thenotation

={n_(m) ^((start)), n_(m) ^((start))+1, . . . n_(m) ^((end))} to denotethe set with the time indexes for which audio clip m contains audioinformation, it may be assumed in what follows that, for each clip i inthe collection, there is at least another clip j, such that

_(i)∩

_(j)≠∅. In other words, each UGR overlaps (completely or partly) with atleast another UGR in the collection.

_(m) may represent the time region where UGR in is active, implying thatthe m^(th) recording does not contain any audio information at timeindexes lower or greater than the minimum and maximum index in set

_(m), respectively.

By combinatorially processing the available UGRs, it may be possible toconstruct an audio stream which has larger duration than any of the UGRsindividually. Without loss of generality, it may be assumed that anaudio sequence produced as a result of combining all the available UGRswill extend from n=1 to n=N, corresponding to the earliest and latestmoment of the event, respectively, captured in the M recordings. We alsolet c be the N×1 positive integer vector indicating the number of clipswhich are active at each time instant. In this scenario, 1≤c[n]≤M, ∀nholds.

Iterative normalization concerns a collection of M UGRs which belong tothe same cluster and thus share common content with one another. Thisnormalization has various advantages: it may ensure that all UGRs haveequal significance in the mixing process, avoiding, for example,recordings acquired at a small distance from the sound sources to maskthose captured at distances further apart. Also, accurate normalizationmay be important for constructing a mix without discontinuities andaudible level transitions, which are expected to occur at the transitionpoint where a clip starts or stops participating in the mixing process.In aspects, a specific energy profile p_(m) ⁽⁰⁾ for a UGR may be definedas a non-negative N×1 real vector carrying the energy of the m^(th) clipat each discrete time index (e.g., p_(m) ⁽⁰⁾=[p_(m)[1], . . . , p_(m)[n]. . . , p_(m)[N]]. For time indexes n which do not belong to the timesupport of clip m, a zero value may be assigned as follows:

p m ( 0 )  [ n ] = { Energy   of   m   th   clip   at  time - point   n if   n ∈ m 0 otherwise ( 5 )

With respect to the example depicted in FIG. 2, the energy profile ofthe UGR associated with index m=2 may be all zeros for n=1 until pointn=n₂ ^(start)−1, as well as for n=n₂ ^(end)+1 until n=N.

It may now be observed that the total energy of the sound signal in UGRm may be derived as E_(m)=Σ_(nϵ)

_(m) p_(m) ⁽⁰⁾[n]=Σ_(n=1) ^(N)p_(m) ⁽⁰⁾[n]. An inverse clip cardinalityvector C, which may be an N×1 vector defined as

$C = {\left\lbrack {\frac{1}{c\lbrack 1\rbrack},\frac{1}{c\lbrack 2\rbrack},\ldots \mspace{14mu},\frac{1}{c\lbrack n\rbrack}} \right\rbrack^{T}.}$

The scope of normalization may be to define a scalar μ_(m) to weighteach original UGR and may be implemented through an iterative processdescribed below in Table 1.

TABLE 1 Algorithm for Iterative Normalization Input: Initial energyprofiles p_(m) ⁽⁰⁾, ∀m, Input: Inverse cardinality vector C Input:number of iterations I Output: normalization gains μ_(m),∀m for i = 1 toI do  q^((i)) = Σ_(m=1) ^(M)p_(m) ^((i-1))  P^((i)) = C · q^((i)) (·implies element-wise multiplication)  for m = 1 to M do  p_(m) ^((i)) ←λ_(m) ^((i))p_(m) ^((i-1))  ${{where}\mspace{14mu} \lambda_{m}^{(i)}} = \frac{\sum_{n \in U_{m}}{p^{i}\lbrack n\rbrack}}{\sum_{n \in U_{m}}{p^{({i - 1})}\lbrack n\rbrack}}$ end for end for   $\mu_{m} = \sqrt{\prod_{i = 1}^{I}\lambda_{m}^{(i)}}$

Letting now column vector x_(m) represent the sound signal in audio clipm, this signal can be replaced with its normalized version as:

{circumflex over (x)} _(m)=μ_(m) x _(m)  (6)

a process which is repeated for all m=1, . . . , M.

As previously said, the distance between consecutive points on the grid,say T_(n), might be much larger than the sampling period T_(s)=1/F_(s).The energy at point n for clip m can be calculated through the summation

${{p_{m}\lbrack n\rbrack} = {\sum_{k = 1}^{L}{\frac{1}{L}{x_{m}^{2}\left\lbrack {{\left( {n - 1} \right)L} + k} \right\rbrack}}}},$

where k is the signal sample index.

An alternative version of the previously described iterative process isthe case where the algorithm doesn't stop after a fixed number ofiterations I, but after a criterion is met. In particular, the algorithmmay stop at iteration i when the following condition is met:

|λ_(m) ^((i))−1|≤ϵ,∀m  (7)

where ϵ<<1 is a predefined positive threshold.

Referring back to FIG. 2, each UGR may start and stop at arbitrary timeinstants, and the number of available UGRs at each time instant mayvary. A constant time segment (or simply called time segment in severaloccasions from now on) may be defined as a collection of consecutivetime-points on a grid where the number of available UGRs is constant. Asshown in the example illustrated in FIG. 2, the three UGRs define fiveconstant time segments, indexed with l=1, . . . , 5. The time instantswhere the number of audio clips participating in the mix varies arecalled transition points. In the example of FIG. 2, it may be observedthat the locations of the transition points are k₁=1, k₂=n₃ ^(start),k₃=n₂ ^(start), k₄=n₃ ^(end)+1, k₅=n₁ ^(end)+1, and k₆=n₂ ^(end)+1=N+1.To ensure the mixture resulting from the combination of the M UGRs willnot exhibit sudden level transitions when certain UGRs start or stopparticipating in the mix (e.g., when passing through transition points),the weights with which the different audio signals sum up in the mixingprocess may vary at each time interval l. This is the basic motivationbehind the mixing approaches described with respect to the time domainbelow. In another aspect, also described below, the mixing weights maybe subjected to variations along both time and frequency, therebyexploiting the additional degrees of freedom that a Time-Frequencydomain mixing approach has to offer.

The mixing techniques presented below illustrate various operations formixing media content at step 150. It is noted that the mixing may beperformed without normalization of the UGRs or may be performed afterthe UGRs have been normalized using the normalization techniquesdescribed herein. However, it is noted that in aspects, alternativenormalization procedures may be used.

Time Domain Mixing

Basic Segment-Wise Mixing

In line with the assumptions stated above, assume now that time index nnow actually refers to signal sample values, thus the spacing betweenconsecutive points on the grid is equal to one sampling periodT_(s)=1/F_(s). The augmented signal for UGR m may be represented asfollows:

$\begin{matrix}{{y_{m}\lbrack n\rbrack} = \left\{ \begin{matrix}{{\hat{x}}_{m}\left\lbrack {n - n_{m}^{start} + 1} \right\rbrack} & {{{if}\mspace{14mu} n} \in _{m}} \\0 & {otherwise}\end{matrix} \right.} & (8)\end{matrix}$

which extends from n=1 to n=N, to produce the N×1 augmented signalvector y_(m). Letting D_(l)={k_(l), k_(l)+1, . . . k_(l+1)−1} denote theset with the sample indexes which are in the range of the l^(th)constant time segment, the notation y_(m) ^((l)) may be used to refer tothe portion of the m^(th) augmented signal vector which belongs to thel^(th) time segment. This vector may be constructed as follows:

y _(m,l) =y _(m) [n]| _(nϵD) _(l) =[y _(m) [k _(l) ], . . . ,y _(m) [k_(l+1)−1]]^(T)  (9)

An augmented signal matrix representative of the l^(th) time segment inall M audio recordings may be constructed according to:

y _(l) =[y _(1,l, . . . ,) y _(M,l)]  (10)

Now, let V_(i) denote the set with the active audio clip indexes at thel^(th) time segment. At the l^(th) time segment, all of the availablesignals may be superimposed to produce a mixture s_(l) as follows:

$\begin{matrix}{s_{i} = {{Y_{i}w_{l}} = {\frac{1}{\sqrt{V_{l}}}Y_{l}1}}} & (11)\end{matrix}$

where l_(M×1) is a M×1 column vector full of ones and:

$\begin{matrix}{w_{l} = {\frac{1}{\sqrt{V_{l}}}l_{M \times 1}}} & (12)\end{matrix}$

is the so-called mixing weight vector particular to the l^(th) timesegment.

This mixing process may be repeated for all time segments l=1, . . . , Land the final mixture may be derived by concatenating the mixture ateach time segment as follows:

s=[s ₁ ^(T) , . . . ,s _(L) ^(T)]^(T)  (13)

The fact that Eq. (11) may be weighted by the square root of the numberof active UGRs implies that audio clips participating in the mixingprocess are assumed to be uncorrelated to one another. Indeed thisassumption is not 100% valid, since if the different audio channels werecompletely uncorrelated to one another, it wouldn't be possible tosynchronize them in the first place. However it is reasonable to assumethat even when these channels are synchronized, the degree ofcorrelation among one another is rather small, so that a three dBincrement in the mixture signal power should be anticipated every timethat the number of participating UGRs is doubled. However, in the casethat the different UGRs exhibit stronger correlations, superposition mayhave constructive or destructive effects. As this may result to unwantedlevel variations when passing through transition points, an approach forscaling the mixing weights so that this problem is avoided is describedbelow.

Mixing with Target Power

To suppress audible level transitions between two consecutive constanttime segments in the obtained mixture, an estimation of the target power{tilde over (q)}_(l) at each time segment l may be used in order toscale the mixing weights derived in Eq. (12) so as to meet the targetpower. For example, using the weights w_(l) from Eq. (12), the signalpower of the mix in the l^(th) time segment may be represented by:

q _(l) =w _(l) ^(T) Y _(l) ^(T) Y _(l) W _(l)  (14)

A new weight vector may now be constructed based on the ratio betweenthe target and the actual signal power, which may be expressed asfollows:

$\begin{matrix}{{\overset{\sim}{w}}_{l} = {\frac{\sqrt{{\overset{\sim}{q}}_{l}}}{\sqrt{q_{l}}}w_{l}}} & (15)\end{matrix}$

One reasonable choice for the target power at the l^(th) time segmentmay be given by

$\begin{matrix}{{\overset{\sim}{q}}_{l} = {\frac{1}{V_{l}}{tr}\left\{ {Y_{l}^{T}Y_{l}} \right\}}} & (16)\end{matrix}$

where tr{⋅} denotes the trace of a matrix. In this case, the mixingweights may become:

$\begin{matrix}{{\overset{\sim}{q}}_{l} = {\frac{1}{V_{l}}\sqrt{\frac{{tr}\left\{ {Y_{l}^{T}Y_{l}} \right\}}{I^{T}Y_{l}^{T}Y_{l}I}}I_{M \times I}}} & (17)\end{matrix}$

Again, as in Eq. (11) the mixture at the l^(th) time segment may bederived from s_(l)=Y_(l){tilde over (w)}_(l) and the full mixture may bethe result of concatenating all time segments, as in Eq. (13).

Time-Frequency Domain Mixing

As briefly described above, working in the time domain is simple andcomputationally efficient, but the mixing process may gain additionalflexibility when implemented in the Time-Frequency (TF) domain. A FastFourier Transform (FFT) based overlap-add method may be used in order totransform the signal from the time domain to the TF domain and then anInverse Fast Fourier Transform (IFFT) to transform it back to the timedomain. In general, transformations from the time domain to the TFdomain and backwards can be mathematically expressed as:

y _(m,l) [n]↔Y _(m,l)(τ,ω)  (18)

where τ denotes the time-index, ω the frequency index and Y_(m,l)(τ,ω)refers to the portion of the m^(th) augmented signal which is active attime segment l. A vector of size V_(l)×1 and containing the TF signalportions only from the signals which are active in the l^(th) timesegment may then be defined as follows:

Y _(l)(τ,ω)=Y _(m,l)(τ,ω)|_(mϵV) _(l)   (19)

The general equation describing the mixing process for the l^(th) timesegment in the TF domain may be expressed as:

S _(l)(τ,ω)=w _(i) ^(H)(τ,ω)Y _(l)(τ,ω)  (20)

where w_(l)(τ,ω) is the V_(l)×1 complex weight vector and (⋅)^(H)denotes a lermitian transposition. It is noted that the mixing weightsw_(l)(τ,ω) may now be time-dependent, frequency-dependent or time- andfrequency-dependent, providing remarkable flexibility in affecting theoutcome of the mixing process. Aspects of two different techniques forchoosing what these weights should be at each TF point are describedbelow.

Maximum Component Elimination (MCE)

The MCE technique allows the mixing weights associated to some UGRs tooccasionally become zero (e.g., to completely remove the contribution ofsome TF components in the mixing process), which may provide anefficient technique for removing the interference from sound sourceswhich are in the foreground of each recording location. This techniquemay be applicable at time-segments where the number of active UGRs isgreater or equal to two (e.g., |V_(l)|≥2 holds). In aspects, at eachtime and frequency index, the audio signal portions are ordered indescending order with respect to their energies, then, the mostenergetic component is removed from the mix by assigning a weight equalto zero. For example, let:

w _(l)(τ,ω)=[w _(1,l)(τ,ω), . . . ,w _(j,l)(τ,ω)]^(T)  (21)

be the weight vector (J=|V_(l)|), then w_(lj) takes the same value forall j=1, . . . , J except from the UGR index corresponding to themaximum component, which may be expressed as:

$\begin{matrix}{{w_{j,l}\left( {\tau,\omega} \right)} = \left\{ \begin{matrix}{0,{{{if}\mspace{14mu} {{Y_{j,l}\left( {\tau,\omega} \right)}}} > {{{Y_{i,l}\left( {\tau,\omega} \right)}}{\forall{i \neq j}}}}} \\{otherwise} \\\frac{1}{\sqrt{{V_{l}} - 2}}\end{matrix} \right.} & (22)\end{matrix}$

An extension of this approach to the case that not the single one butthe Q=2, 3, . . . most energetic components are removed from the mix mayalso be utilized, and the constant value for all non-zero weightsbecomes

$\frac{1}{\sqrt{{V_{l}} - Q}}.$

Intuitively, this may allow for more foreground energy to be removedfrom the mix, possibly at the cost of more audible artifacts.

An extension of the previous approach for the case that the operation isnot specific to frequency bin but to a subband region containingmultiple consecutive frequency bins may be utilized where the mixingweight may be specific to each subband index b and may be calculated as:

$\begin{matrix}{{w_{j,l}\left( {\tau,b} \right)} = \left\{ \begin{matrix}{0,{{{if}\mspace{14mu} {{Z_{j,l}\left( {\tau,b} \right)}}} > {{{Z_{i,l}\left( {\tau,b} \right)}}{\forall{i \neq j}}}}} \\{otherwise} \\\frac{1}{\sqrt{{V_{l}}2}}\end{matrix} \right.} & (23)\end{matrix}$

where Z_(j,l)(τ,b)=Σ_(ωϵS) _(b) |Y_(j,l)(τ,ω)|² and S_(b) is the setwith the frequency indexes belonging to the b^(th) subband region.

Minimum Variance Mixing (MVM)

Similar to the MCE technique described above, an MVM technique may alsobe used to suppress interference components which are unique at eachrecording location and at the same time to reveal the components whichare common within the different recordings. For example, an augmentedsignal covariance matrix may be defined as:

Φ_(l)(τ,ω)=E{Y _(l)(τ,ω)Y _(l) ^(H)(τ,ω)}  (24)

where E{ } denotes expectation. The dimension of this matrix may be|V_(l)|×|V_(l)| and thus depends on the number of active UGRs at eachconstant time segment. The weights used for mixing may be derived as asolution to an optimization problem which involves minimization of thetotal signal power of the mixture signal subject to a linear equalityconstraint. In particular, the optimization problem may be writtenindependently at each TF point and time segment I as:

Minimize w _(l) ^(H)(τ,ω)Φ_(l)(τ,ω)w _(l)(τ,ω) so that d _(l) ^(H) w_(l)(τ,ω)=1  (25)

As a solution to this optimization problem, one may use the formula:

$\begin{matrix}{{w_{l}\left( {\tau,\omega} \right)} = \frac{\left\lbrack {{{\Phi_{l}\left( {\tau,\omega} \right)} +} \in I} \right\rbrack^{- 1}d_{l}}{{d_{l}^{H}\left\lbrack {{\Phi_{l}\left( {\tau,\omega} \right)} + I} \right\rbrack}^{- 1}d_{l}}} & (26)\end{matrix}$

where I is the |V_(l)|×|V_(l)| identity matrix and ε is a positiveconstant which may be defined by a user. It is noted that Eqs. (25) and(26) are well known as the cost function and the solution of the MVDRbeamformer which involves a completely different setting, sincebeamforming requires particular microphone arrangements and assumes thatthe locations of the microphones are known (when dealing with UGRs, noneof these conditions are true). This approach may be made applicable tothe case of mixing the UGRs using the choice of:

$\begin{matrix}{d_{l} = {\frac{1}{\sqrt{V_{l}}}I}} & (27)\end{matrix}$

where I is a |V_(l)×1 vector full of ones. The mixture at each TF pointmay then be calculated by inserting the outcome of Eq. (26) in Eq. (20).

Segment-Wise Phase Alignment

Per-segment phase alignment may be required for segments with two ormore active clips, so that destructive interference is avoided duringthe mixing process. In aspects, this phase alignment step is optionaland concerns all the mixing approaches presented so far. The phase ofthe m^(th) time-domain audio signal in the l^(th) segment may bepreserved or reversed depending on the sign of its cross-correlationwith the other clips which are active in the same time segment. Thisprocess may be written as:

y _(m,l) =q _(m,l) y _(m) [n]| _(nϵD) _(l)   (28)

where q_(m,l)=1 or −1, so that the following holds:

y _(i,l) ^(T) y _(j,l)>0∀i,jϵV _(l)  (29)

The y_(m,l) vector presented in Eq. (29) may be used to replace thecorresponding ones in Eqs. (9) and (18), without further affecting thepresented approaches.

It should be noted that the criterion presented in Eq. (29) is somewhatambiguous which can be seen by the fact that if y_(i,l) ^(T)y_(j,l)<0holds for two clips indexed with land j, either the phase of i should bereversed, or that of j. A criterion for deciding which audio channel'sphase should be reversed is that the same audio clip, when moving fromone segment to the other, should undergo as few phase reversals aspossible, so that discontinuities at the transition points are minimizedand/or reduced.

As shown above, the method 100 provides various techniques forprocessing UGRs to produce a new audio file/content, where the UGRs maybe captured by different users, at different locations of an event, andwhere the UGRs may capture different portions of the event with respectto time. Further, the method 100 may improve the quality of the finalaudio content relative to one or more individual UGR used to generatethe final audio content. For example, the final audio content may be amultichannel recording and may be generated in a manner thatminimizing/reducing the noticeability of transition points (e.g., whereone or more UGRs start and/or stop in the mixed final audio content). Inaspects, an extension of the proposed mixing approaches to areproduction system with multiple output audio channels may beconceptualized; UGRs may be grouped into separate clusters, each groupmay then be processed independently from the other and the resultingaudio streams may be panned to different directions. In aspects, methodssuch as stereophonic panning or Vector Base Amplitude Panning (VBAP) maybe used.

Referring to FIG. 3, a flow diagram illustrating exemplary aspects ofanother process for generating media content by mixing a plurality ofUGRs in accordance with embodiments of the present disclosure is shownas a method 300. Although not shown in FIG. 3 in order to simplify thedrawing, the method 300 may include receiving a plurality of UGRs. Theplurality of UGRs may include audio content, visual content (e.g., videocontent, image content, etc.), and/or audio content and visual content.After the plurality of UGRs are received, the method 300 may performvarious operations or steps configured to utilize the plurality of UGRsto generate new media content.

It is noted that the specific notations utilized to represent toprocessing described with reference to FIG. 1 and FIG. 3, whileappearing different, convey similar techniques and processing for mixingUGR media content, as will become apparent from the description thatfollows. For example, the method of FIG. 3 provides an approach formixing UGR media content that may include identifying relationshipsbetween different UGRs using match strength techniques and time-offsetestimation, as well as clustering, synchronization, and normalizationtechniques. Additionally, it is noted that, as compared to FIG. 1, thenotations utilized to describe these similar functionalities withreference to FIG. 3 may represent functional implementations of the UGRprocessing techniques described herein that achieve improved results(e.g., enhanced or higher quality media content mixes) and performance(e.g., more efficient processing and analysis of UGR media content.)

As shown in FIG. 3, the steps of the method 300 may include a step 310where signal signature extraction is performed, a step 320 wherematch-strength and temporal offset processing is performed, a step 330for performing hierarchical clustering and synchronization, a step 340where normalization processing is performed, and a step 350 where UGRsare mixed to create new media content. The signal signature extractionstep 310 may utilize a feature extraction process that accepts a UGR(e.g., a video recording or audio recording) as input and produces anoutput that provide a signature representative of the input UGR. Theoutputs (e.g., the signatures representative of the input UGRs)generated by the feature extraction process of the signal signatureextraction step 310 may be utilized to determine a similarity betweentwo signals (e.g., UGRs). The match-strength and temporal offsetprocessing step 320 may be configured to assess the similarity between apair of outputs (e.g., a pair of signal or UGR signatures) generated bythe signal signature extraction step 310. In an aspect, featuresutilized to perform signature extraction may be extracted at regulartime intervals, which may allow the match-strength and temporal offsetstep 320 to be applied to any pair of video or audio recordingsregardless of what type of audio features and/or visual features areused.

Based on the pairwise relationships determined for a collection of UGRs,the hierarchical clustering and synchronization step 330 may separatethe plurality of UGRs into clusters or groups and synchronize themembers (e.g., the UGRs) of each cluster along the same time-axis, whichmay facilitate synchronous playback of the UGRs irrespective of whetherthe UGRs include audio content or visual content, such as video contentand without concern as to whether the UGRs are received as differentfile types, such as audio files (e.g., .mp3 files, .wav files, .mp4files, etc.) or video files (e.g., .mov files, .avi files, .wmv files,etc.). The normalization step 340 may be configured to modify signallevels of the UGRs, which may significantly improve the sound qualityduring playback of the UGRs and/or the sound quality of new mediacontent generated from the UGRs, where playback and/or generation of thenew media content may be facilitated by the mixing step 350. It is notedthat the mixing step 350 may utilize various different audio mixingtechniques, each of which may be applicable to any type of content(audio or video).

Additional exemplary aspects of the steps 310-350 of the method 300 aredescribed in more detail below. It is noted that the contributions andimprovements realized by each of the steps 310-350 may be realizedindependent of the other steps and that each of the steps 310-350 doesnot depend on the specific approach utilized in the previous step(s).For example, the hierarchical clustering step 330 does not depend on thetype of similarity measure used to assess the relationships between UGRsduring the match-strength and temporal offset processing step 320, andeach of the various mixing approaches described in connection with themixing step 350 may be utilized to generate new media content from aplurality of UGRs irrespective of what approach was used for thehierarchical clustering and synchronization step 330 or thenormalization step 340, if any such approach was used.

As explained above, a pairwise match strength and time-offset estimationmay be utilized to assess the similarity between a pair of outputs(e.g., a pair of signal or UGR signatures) generated by the signalsignature extraction step 310. An exemplary process for generatingsignal signatures suitable for supporting pairwise match strength andtime-offset estimation in accordance with the present disclosure willnow be described.

As explained above, a feature extraction process (e.g., step 310 of FIG.3) may accept a UGR (e.g., a video recording or an audio recording) asinput and produce an output that provides a representative signature ofthe input UGR. For the ith video or audio recording, the output of thefeature extraction step 310 may include data in the form of an N_(i)×Bbinary matrix {tilde over (F)}_(i), where N_(i) is related to theduration of the recording, B depends on the dimensionality of the signalsignature and may be any integer number equal or greater to 1. Thismatrix may be obtained using multiple types of features, acousticfeatures and/or visual features. In an aspect, this matrix may beconstructed by extracting features at regular time intervals, which, asexplained above, may allow the match-strength and temporal offset step320 to be applied to any pair of UGRs regardless of what type of audiofeatures and/or visual features were used. Saying that matrix {tildeover (F)}_(i) is binary, means that each element of {tilde over (F)}_(i)can have only one out of two different numeric or symbolic values. Ifthis holds, it can be transformed to new matrix F_(i) such that eachelement of this new matrix is either equal to 1 or equal to −1. Forexample, in case each element of {tilde over (F)}_(i) takes a value inthe set {0,1}, a new matrix F_(i) can be constructed from {tilde over(F)}_(i) by converting the value of all elements equal to 0 to the valueof −1. In case the elements of {tilde over (F)}_(i) are symbolic, matrixF_(i) can be constructed from {tilde over (F)}_(i) by specifying anarbitrary association between the two symbols and the numeric values of1 and −1.

The match-strength and temporal offset step 320 may receive two signalsignatures F_(i) and F_(j) of size N_(i)×B and N_(j)×B respectively, andprovide the two signal signatures to an unbiased similarity estimatorconfigured to assess the similarity r_(ij) between the two signals. Anexemplary unbiased similarity estimator designed based on a modificationof the classical cross-correlation operation involving two finite-lengthdiscrete sequences, is described in more detail below.

The generalized cross-correlation between two finite length sequences xand y of length N_(x)×1 and N_(y)×1 respectively may be defined asfollows:

$\begin{matrix}{{R_{x,y}(\tau)} = \left\{ {\begin{matrix}{{\frac{1}{\sqrt{u(\tau)}}{\sum\limits_{n = 1}^{u{(\tau)}}{{x\left( {n + {\tau }} \right)}{y(n)}}}},} & {{{- N_{x}} + 1} < \tau < 0} \\{{\frac{1}{\sqrt{u(\tau)}}{\sum\limits_{n = 1}^{u{(\tau)}}{{x(n)}{y\left( {n + \tau} \right)}}}},} & {0 < \tau < {N_{y} - 1}}\end{matrix},} \right.} & (30)\end{matrix}$

where u(τ) is defined as

$\begin{matrix}{{u(\tau)} = \left\{ \begin{matrix}{{\min \left\{ {N_{x},N_{y},{N_{x} - {\tau }}} \right\}},} & {{{- N_{x}} + 1} < \tau \leq 0} \\{{\min \left\{ {N_{x},N_{y},{N_{y} - \tau}} \right\}},} & {0 < \tau < {N_{y} - 1}}\end{matrix} \right.} & (31)\end{matrix}$

and min{ } returns the minimum number in a set of numbers. Thiscross-correlation scheme produces N_(x)+N_(y)−1 values, defined onvalues of the time-lag that the overlap between the two sequences is atleast 1 sample. It can be observed that the value at each lag isnormalized with respect to the square root of the number of summationterms (√{square root over (u(τ))}) in the right side of Eq. (30). Thistype of weighting is designed to remove the bias due to the fact that,at different values of the time-lag τ, the amount of availableinformation varies. A modified version of Eq. (30) can be derived bytaking into account μ_(x) and μ_(y), i.e., the mean value of sequences xand y. In this case Eq. (30) can be rewritten as:

$\begin{matrix}{{R_{x,y}(\tau)} = \left\{ \begin{matrix}{{\frac{1}{\sqrt{u(\tau)}}{\sum\limits_{n = 1}^{u{(\tau)}}{\left( {{x\left( {n + {\tau }} \right)} - \mu_{x}} \right)\left( {{y(n)} - \mu_{y}} \right)}}},} & {{{- N_{x}} + 1} < \tau < 0} \\{{\frac{1}{\sqrt{u(\tau)}}{\sum\limits_{n = 1}^{u{(\tau)}}{\left( {{x(n)} - \mu_{x}} \right)\left( {{y\left( {n + \tau} \right)} - \mu_{y}} \right)}}},} & {0 < \tau < {N_{y} - 1}}\end{matrix} \right.} & (32)\end{matrix}$

where u(τ) is defined as before from Eq. (31).

In several cases, it may be beneficial to calculate thecross-correlation between two sequences in the frequency domain, usingfor example a Fourier transform. This may significantly reduce thecomputational complexity of the operations. In this case the proposedcross-correlation scheme can be obtained as follows:

1. A program may be used to obtain a sequence that returns the biasedcross-correlation C_(x,y)(τ) between two sequences x and y. Let c be thesequence with the cross-correlation values returned by the program.

2. Weight each term in c with

$\frac{1}{\sqrt{u(\tau)}},$

where u(τ) is defined as a function of the time lag according to Eq.(31).

It is to be noted that some routines may require that the inputsequences x and y are of equal length. In this case, the shortest of thetwo sequences, say x, may be padded with zeros after the last sample sothat its length becomes equal to N_(y). The aforementioned method may beapplied as is to the two sequences {circumflex over (x)} and y, wherenow {circumflex over (x)} is produced from x by adding N_(y)−N_(x) zerosafter the last sample.

The above-described exemplary unbiased similarity estimator may beutilized to estimate a match strength between a pair of UGRs. Forexample, let F_(i) ^(b) denote the bth column of the signal signaturematrix F_(i). The cross correlations between two signal signaturematrices F_(i) and F_(j) is obtained from

$\begin{matrix}{{{R_{F_{i},F_{j}}(\tau)} = {\sum\limits_{b = 1}^{B}{R_{F_{i}^{b}F_{j}^{b}}(\tau)}}},} & (33)\end{matrix}$

where

R_(F_(i)^(b)F_(j)^(b))(τ)

denotes the modified cross-correlation operation defined above. Notethat if a zero-padding approach is required for implementing thecross-correlation function, the signal signature vectors F_(i) ^(b) mayalso contain zeros apart from the values of 1 and −1. In an aspect, amaximum value of the cross-correlation, which may be expressed as:

r _(ij)=maxR_(F) _(i) _(,F) _(j) (τ)  (34)

may be used as an estimation of the match strength or similarity betweenthe two UGRs (e.g., recordings i and j), where max { } returns themaximum value from a set of numbers. An additional outcome of thisprocess is the time-frame difference, which may be defined as:

$\begin{matrix}{\tau_{ij} = {\arg \; {\max\limits_{\tau}{R_{F_{i},F_{j}}(\tau)}}}} & (35)\end{matrix}$

which potentially defines the temporal offset that is required forsynchronizing recordings i and j.

The above-described process may be applied for all pairwise combinations{i, j} such that j>i and the corresponding similarity measures (or matchstrengths) and temporal offsets can be stored in M×M matrices {tildeover (R)} and {tilde over (T)} respectively. These matrices may bepopulated only above the main diagonal and all other values may beinitially set equal to zero. From these half-populated matrices one mayeasily produce versions of fully populated matrices, using theoperations R={tilde over (R)}+{tilde over (R)}^(T) and T={tilde over(T)}−{tilde over (T)}^(T). The minus sign in ({tilde over (T)}−{tildeover (T)}^(T)) is explained by the fact that if τ_(ij) is the temporaloffset that synchronizes recording j with recoding i, thenτ_(ij)=−τ_(ij) should hold.

As explained above, the hierarchical clustering and synchronization step330 may be configured to identify relationships among the plurality ofUGRs and group the UGRs identified as sharing or having relationshipswith each other into one or more clusters. Consider a collection ofMUGRs indexed by m=1, . . . , M. Let M×M matrix R denote the matrixcontaining all pairwise match strengths between UGRs, using for examplethe techniques described above. Letting τ_(ij) denote the element at theith row and jth column of R, a distance matrix D, which may be anotherM×M matrix may be constructed such that

$\begin{matrix}{D_{ij} = \frac{1}{R_{ij}}} & (36)\end{matrix}$

Matrix D may be a symmetric matrix which is informative about allpairwise distances between UGRs. An additional M×M matrix T is assumedto be available, such that τ_(ij), the element in the ith row and jthcolumn of T, denotes the number of audio samples (or time-frames) thatith recording must be delayed with respect to the jth recording in orderto have the two recordings synchronized along the same temporal axis.This matrix is referred to as the temporal displacement matrix from nowon. Several approaches to calculate these temporal displacements exist;however, as explained above, the synchronization process disclosedherein is not specific to how T is calculated. In an aspect, thetemporal information provided by the temporal displacement matrix T andthe information provided by the distance matrix D may be utilized togroup different UGRs into clusters and to synchronize UGRs within thesame cluster (if there are more than two), as described in more detailbelow.(1.) Assuming each UGR represents an observation and that D_(ij), i≠j isthe distance between observations i and j, an agglomerative hierarchicalcluster tree linking all M UGRs may be generated.(2.) The number of clusters and the identities of the UGRs in eachcluster may be estimated by setting a threshold D_(max) and requestingthat the distance that a UGR should exhibit before entering a particularcluster is smaller than D_(max). This may produce a minimum of one and amaximum of M clusters and the number of UGRs in each cluster will rangebetween 1 and M.(3.) A graph consisting of M nodes may be constructed so that the m^(th)node corresponds to the m^(th) UGR and so that initially, no node isconnected to another node. Based on the results of hierarchicalclustering, as obtained from step (1.) above, connect only those nodeswhich are in the same cluster, as obtained from the step (2.) above.(4.) For all clusters with a number of members L>1, the linkage willresult in a tree. The UGRs in each tree may be synchronized as follows.First, choose any UGR in the tree as a reference UGR. Now, let p(i→r)denote the path connecting node i with reference node denoted with r.Obtain the temporal displacement Q_(ir) that synchronizes UGR i with UGRr by summing together all time offsets specific to the edges that formthe path from i to r. For example, if r=1 and i=8, and the pathconnecting UGR 8 with UGR 1 is the set {8, 3; 4, 1}, thenQ_(8,1)=τ_(8.3)+τ_(3,4)+τ_(4.1), with τ_(ij) extracted directly from thetemporal displacement matrix T. This process may be repeated until allnodes in a particular cluster are synchronized with the reference noder.(5.) From the previous process, a set of L−1 temporal offsets isproduced. The hierarchical synchronization process may be finalized asfollows. If all L−1 temporal offsets are equal to or greater than zero,then the UGR with index r is the earliest starting UGR, i.e., therecording that was initiated earlier than any other recording in theparticular cluster. For example, and referring to FIG. 4, a diagramillustrating aspects of extracting information from connected UGRs isshown. As shown in FIG. 4, the three connected UGRs includes a first UGR410, a second UGR 420, and a third UGR 430. The UGRs 410, 420, 430 maybe “connected” in the sense that they include audio content and/or videocontent that overlaps with respect to time, which may result in the UGRs410, 420, 430 being grouped into a single cluster. Because the first UGR410 starts “Y” units of time after the start of the second UGR 420 andthe third UGR 430 starts “X” units of time after start of the second UGR420, the second UGR 420 is the earliest starting UGR (n₂ ^(start)=0).Referring back to FIG. 3, if all L−1 temporal offsets are not equal toor greater than zero (e.g., if there are negative temporal offsets),find the UGR index in the cluster returning the largest negative value

$\begin{matrix}{k = {\arg \; {\min\limits_{i}Q_{ir}}}} & (37)\end{matrix}$

Then, repeat the process of the previous step by setting UGR with indexk as the reference UGR. For the UGR with index k, set n_(k) ^(start)=0.For all other UGRs in the same cluster, set n_(k) ^(start)=Q_(ik). Thevalues of n_(k) ^(start) will be non-negative and denote the sample (ortime-frame) delay value required for synchronizing each UGR with respectto UGR with index k, which is the earliest starting UGR in the givencluster.

The above-described clustering and synchronization techniques, whichhave been provided for purposes of illustration, rather than by way oflimitation, may be used to generate synchronization data. In an aspect,the synchronization data may be utilized to facilitate synchronizedplayback of the associated UGRs. In an additional aspect, thesynchronization data may be utilized by additional processes, such asthe mixing step 350, described in more detail below.

As explained above, the method 300 may include a normalization step 340.It is noted that the exemplary normalization processes described belowmay be applied to so-called connected recordings, e.g., recordings thatbelong to the same cluster. The normalization process has variousadvantages. First of all, it ensures that all audio clips have equalsignificance in the mixing process, avoiding for example recordingswhich are acquired at a small distance from the sound sources to maskthose captured at distances further apart. Also, accurate normalizationis important for constructing a mix without discontinuities and audiblelevel transitions, which are expected to occur at a transition pointwhere a particular UGR within the cluster starts or stops participatingin the mixing process.

Consider a collection of M≥2 connected UGRs. Assume also that all UGRsare available in pulse code modulation (PCM) format and let x_(m)[n]denote the value of the nth sample of the mth UGR. The normalizationprocess may define a set of M normalization gains g=[g₁, . . . ,g_(M)]^(T) to scale all recordings according to {circumflex over(x)}_(m)[n]=g_(m)x_(m)[n], ∀m.

Without loss of generality, an illustrative example of the variousnotations based on the three UGRs will now be described. Referringbriefly to FIG. 5, a diagram illustrating additional aspects ofextracting information from connected UGRs is shown. As shown in FIG. 5,the three mutually overlapping UGRs include a first UGR 510, a secondUGR 520, and a third UGR 530. Each of the UGRs 510, 520, 530 may includea recording of audio content, and each of the UGRs 510, 520, 530 mayhave been captured at different distances from a source of the audiocontent. As shown in FIG. 5, the three UGRs 510, 520, 530 may define sixtransition points n and five time segments j. The points in timecorresponding to the beginning or ending of each UGR define theso-called transition points n, and j is used to index a time segmentextending between two consecutive transition points. In FIG. 5, p_(m,j)denotes the energy of the unsealed mth UGR in the jth segment and c_(j)ϵ

⁺ denotes the plurality of UGRs which are active in the jth segment. AUGR is considered to be active or inactive depending on whether therecording process was ongoing at the particular point in time or not.The notation

(m) used to denote the set with the segment indexes which fall withinthe range of the mth UGR. If j is not an element of the set

(m), then p_(m,j)=0.

Referring back to FIG. 3, an exemplary process that may be followedduring the normalization step 340 to normalize the UGRS is illustratedin Table 2 below.

TABLE 2 Algorithm for Iterative Normalization Input: Initial energyprofiles p_(m) ⁽⁰⁾, ∀m, Input: number of iterations I Output:normalization gains g_(m), ∀m for i = 1 to I do  for k = 1 to M do  $\lambda_{k}^{(i)} = \frac{\sum_{j \in {{(k)}}}{\frac{1}{c_{j}}{\sum_{m = 1}^{M}p_{m,j}^{({i - 1})}}}}{\sum_{j \in {{(k)}}}p_{k,j}^{({i - 1})}}$ p_(k) ^((i)) ← λ_(k) ^((i))p_(k) ^((i-1))  end for end for${g_{k} = \sqrt{\prod_{i = 1}^{l}\lambda_{k}^{(i)}}},{{\forall k} = 1},...\mspace{11mu},M$

Letting now x_(m)[n] represent the sound signal in audio clip m, thissignal can be replaced with its normalized version, which may beexpressed as:

{circumflex over (x)} _(m) [n]=g _(m) x _(m) [n]  (38)

This process may be repeated for all audio clips m=1, . . . , M.

In addition to the previously described iterative process, anothernormalization technique that uses an algorithm that doesn't stop after afixed number of iterations I may be used in accordance with aspects ofthe present disclosure. This algorithm may be configured to stop after astop criterion is met. For example, one such criterion may be thefollowing: Stop at iteration i when the following condition is met:

|λ_(m) ^((i))−1|≤ϵ,∀m  (39)

where ϵ<<1 is a predefined positive threshold.

As described above, the mixing step 350 may be configured to combine thesynchronized UGRs in order to produce a monophonic mixture that providesa better representation of the acoustic event. The mixing step 350 mayutilize various mixing techniques configured in accordance with aspectsof the present disclosure, as described in more detail below.

As explained above, the mixing step 350 may utilize a time-domain mixingtechnique to combine the synchronized UGRs. Consider a collection of M≥2recordings forming a connected graph. Assume also that all recordingsare available at PCM format and let x_(m)[n] denote the value of the nthsample of the mth recording. All recordings are assumed to be availableat the same sampling rate F_(s). For each recording m=1, . . . , M it isassumed that there is prior knowledge of n_(m) ^(start), which denotesthe sample delay value of each recording with respect to the earlieststarting UGR, as described above with respect to FIG. 4. Thissynchronization information may be used to synchronize all UGRs alongthe same temporal axis such that there is a one-to-one correspondencebetween sample index n and time t through t=(n−1)/F_(s).

(n) may be defined as a set with the UGR indexes that are active at timet=(n−1)/F_(s) and L_(n) may denote the cardinality of that set, where1≤L_(n)≤M. This information may, for example, be obtained by applyingthe hierarchical clustering and synchronization process described above.

The time-domain mixer may be implemented through:

$\begin{matrix}{{s\lbrack n\rbrack} = {\frac{1}{\sqrt{L_{n}}}{\sum_{m \in {{(n)}}}{x_{m}\left\lbrack {n - n_{m}^{start} + 1} \right\rbrack}}}} & (40)\end{matrix}$

where mϵ

(n) returns the UGR indexes that were active at time n. Observe that atthe right hand side of Eq. (11), the mixture is weighted by the inverseof the square root of the number of summation terms.

An additional version of the time-domain mixer can be realized byworking not with the original UGRs, but with their normalized versions{circumflex over (x)}_(m)[n], for m=1, . . . , M as may be derived fromthe normalization step 340 described above. In this case, the mixingprocess can be expressed as:

$\begin{matrix}{{s\lbrack n\rbrack} = {\frac{1}{\sqrt{L_{n}}}{\sum_{m \in {{(m)}}}{{\hat{x}}_{m}\left\lbrack {n - n_{m}^{start} + 1} \right\rbrack}}}} & (41)\end{matrix}$

As shown above, the mixing step 350 may be configured to mix a pluralityof UGRs based on the original UGRs (e.g., without normalization) orbased on normalized versions of the UGRs, as may be derived from thenormalization step 340. Thus, the mixing step 350 is not dependent uponthe normalization step 340, which may be skipped or omitted if desired.However, as explained above, the normalization step 340 may provideimprovements to the overall sound quality achieved by the mixing step350, which may be advantageous for some applications.

In addition to utilizing time-domain mixing techniques, the mixing step350 may also be configured to implement time-frequency-domain mixing.Working in the time domain is simple and computationally efficient, butadditional flexibility with respect to the mixing process may berealized when the mixing step 350 is implemented in the Time-Frequency(TF) domain. A Fast Fourier Transform (FFT) may be used to transform thesignal from the time domain to the TF domain and then an Inverse FastFourier Transform (IFFT) may be used to transform the signal back to thetime domain. In general, transformations from the time domain to the TFdomain and backwards can be mathematically expressed as:

x _(i) [n]↔X _(i)(τ,ω)  (42)

where τ denotes the time-index, w the frequency index. Here X_(i)(τ,ω)defines a complex signal portion that is specific to a TF point,representing the smallest piece of information that can be manipulatedfor constructing the final mixture. In the general case, larger signalportions can be considered by using a partitioning of the spectrum intomultiple subbands, so that each frequency subband contains multiplesuccessive Fourier coefficients. To illustrate, let Ω(o) denote the setwith the frequency indexes in the oth subband. Vector X_(i)(τ, o)represents the signal portion from the ith UGR containing all thecomplex Fourier coefficients in the oth subband:

X _(i)(τ,o)=[X _(i)(τ,ω)]_(ωϵΩ(o))  (43)

where [⋅]_(i) denotes vertical concatenation.

Mixing techniques intended for use with UGRs may need to account for atemporally varying number of input channels. Different UGRs may spandifferent temporal versions of the event and therefore the number ofcomponents available may vary with time frame τ. Further to this, onemay decide to disregard certain subband indexes (or subband channels)from certain UGRs, due to average energy criteria, for example. Nowassume a general channel selection process that operates on a givendataset of M connected UGRs to return at each time and subband index,and τ and o respectively, a set of active UGR indexes

(τ, o) such that the set

(τ, o) is not empty, and let L_(τ,o) be a positive integer denoting thecardinality of set

(τ, o), where 1≤L_(τ,o)≤M and there is a one-to-one correspondencebetween available UGR portions indexed with i and selected UGR portionsindexed with l. In what follows, the notation {circumflex over(X)}_(l)(τ, o), l=1, . . . , L_(τ,o), is used to refer to the differentcomponents from L_(τ,o) selected UGR portions.

A general equation for a mixing process in the TF domain may be writtenas:

S(τ,o)=τl ^(L) ^(τ,o) w _(l)(τ,o){circumflex over (X)} _(l)(τ,o),  (44)

The simplest approach for mixing the selected channels is to considerequal weights, in which case the output signal can be derived from:

$\begin{matrix}{{{S\left( {\tau,o} \right)} = {\frac{1}{\sqrt{L_{\tau,o}}}{\sum_{i \in {{({\tau,o})}}}{X_{i}\left( {\tau,o} \right)}}}},} & (45) \\{\mspace{70mu} {{= {\frac{1}{\sqrt{L_{\tau,o}}}{\sum\limits_{l}^{L_{\tau,o}}{{\hat{X}}_{l}\left( {\tau,o} \right)}}}},}} & (46)\end{matrix}$

Similar to the time-domain mixing technique described above,

$\frac{1}{\sqrt{L_{\tau,o}}}$

may be used to balance the appearance and disappearance of certain inputchannels in the mix, based on the assumption of independence of theUGRs. Note that using S(τ,o), the final time-domain signal can bedesigned using the inverse Fourier transform. It is noted that thetime-frequency domain approaches presented above may also be applied tocases where each subband region consists of exactly one frequency bin.

In addition to the above described time-domain and TF domain mixingtechniques, the mixing step 350 may also be configured to utilizemaximum component elimination mixing techniques. For example, assume achannel selection process that operates on a given dataset of Mconnected UGRs to return at each time and subband index, and orespectively, a set of active UGR indexes

(τ,o) such that the set

(τ, o) is not empty and let L_(τ,o) be the positive integer denoting thecardinality of set

(τ,o), where 1≤L_(τ,o)≤M and there is a one-to-one correspondencebetween available UGR portions indexed with i and selected UGR portionsindexed with l. Similar as before, the notation {circumflex over(X)}_(l), l=1, . . . , L_(τ,o), may be used to refer to the differentcomponents from L_(τ,o) selected UGR portions.

The maximum component elimination mixing process described in thissection may be implemented in the time subband domain through:

S(τ,o)=Σ_(l) ^(L) ^(τ,o) w _(l)(τ,o){circumflex over (X)}_(l)(τ,o),  (47)

with weights w₁, l=1, . . . , L_(τ,o) defined as follows. At each timeand subband index, the L_(τ,o) audio signal portions are ordered indescending order with respect to their energies. Then, the mostenergetic component is removed from the mix by assigning a weight equalto zero. Mathematically, this can be formulated as:

$\begin{matrix}{{w_{l}\left( {\tau,o} \right)} = \left\{ {\begin{matrix}{0,} & {{{if}\mspace{14mu} {{{\hat{X}}_{l}\left( {\tau,o} \right)}}_{2}} > {{{{\hat{X}}_{k}\left( {\tau,o} \right)}}_{2}{\forall{k \neq l}}}} \\{\frac{1}{\sqrt{L_{\tau,o} - 1}},} & {otherwise}\end{matrix},} \right.} & (48)\end{matrix}$

where ∥⋅∥₂ denotes the Euclidean norm of a vector. An extension of theprevious approach is to exploit prior information about the averagesignal power in each channel. Let P_(m)(o) denote the average powerspectral density of the mth UGR in the oth subband, and assume that thisquantity is precalculated for all m=1, . . . , M UGRs. In this case

$\frac{{{{\hat{X}}_{m}\left( {\tau,o} \right)}}_{2}^{2}}{P_{m}(o)}$

provides a normalized estimation of the local energy and the weights arein this case derived from:

$\begin{matrix}{{{\hat{w}}_{l}\left( {\tau,o} \right)} = \left\{ \begin{matrix}{0,} & {{{if}\mspace{14mu} \frac{{{{\hat{X}}_{l}\left( {\tau,o} \right)}}_{2}^{2}}{P_{l}(o)}} > {\frac{{{{\hat{X}}_{k}\left( {\tau,o} \right)}}_{2}^{2}}{P_{k}(o)}{\forall{k \neq l}}}} \\{\frac{1}{\sqrt{L_{\tau,o} - 1}},} & {otherwise}\end{matrix} \right.} & (49)\end{matrix}$

The output signal may then be obtained from:

S(τ,o)=Σ_(l) ^(L) ^(τ,o) ŵ _(l)(τ,o){circumflex over (X)}_(l)(τ,o),  (50)

An extension of the approaches for implementing mixing via maximumcomponent elimination techniques presented above would be the case thatnot the single one but the Q=2, 3, . . . most energetic components areremoved from the mix, and the constant value for all non zero weightsbecomes

$\frac{1}{\sqrt{L_{\tau,o} - Q}}.$

Intuitively, this may allow more foreground energy to be removed fromthe mix, possibly at the cost of more audible artefacts.

Note that using S(τ,o), the final time-domain signal can be designedusing the inverse Fourier transform. Also, another extension to whichthe above-described mixing technique may be applied is the case whereeach subband region consists of exactly one frequency bin.

The mixing step 350 may also be configured to mix UGRs based on aminimum variance mixing technique, which attempts to suppressinterference components unique to each recording location and at thesame time reveal the components which are common within the differentUGRs. Assuming a channel selection process that operates on a givendataset of M connected UGRs to return at each time and subband index,and o respectively, a set of active UGR indexes

(τ,o) such that the set

(τ,o) is not empty and let L_(τ,o) be a positive integer denoting thecardinality of set

(Tτ,o), where 1≤L_(τ,o)≤M and there is a one-to-one correspondencebetween available UGRs indexed with i and selected UGRs indexed with l.Similar as before, the notation {circumflex over (X)}_(l)(τ,o), l=1, . .. , L_(τ,o), may be used to refer to the signal portion of the lthselected UGR.

Now let R(τ,o) be the L_(τ,o)×L_(τ,o) matrix with elements along thediagonal equal to:

R _(ll)(τ,o)=E{{circumflex over (X)} _(l) ^(H)(τ,o){circumflex over (X)}_(l)(τ,o)}  (51)

and all other elements equal to zero R_(ij:i≠j)=0. Here E{⋅} denotesexpectation. MVM may be formulated as a special case of the MVDRbeamformer. Traditionally, MVDR beamforming is implemented using acomplex and fully populated covariance matrix R(τ,o). In the presentedapproach the covariance matrix is real and diagonal (the only non-zeroelements are the diagonal ones). The following optimization problem maynow be defined as:

Minimize q(τ,o)^(T) R(τ,o)q(τ,o) so that

^(T) q(τ,o)=1,  (52)

where vector q(τ,o)=[q_(l)(τ,o), . . . ,q_(L) _(τ,o) _(l)(τ,o)]^(T) isrelated to the mixing weights through

q_(l)(τ,o)=w_(l) ²(τ,o) and

is the L_(τ,o)×1 vector with all ones. A solution to this optimizationproblem, may be expressed using the formula:

$\begin{matrix}{{{q\left( {\tau,o} \right)} = \frac{\left\lbrack {{{R\left( {\tau,o} \right)} +} \in I} \right\rbrack^{- 1}}{{^{T}\left\lbrack {{{R\left( {\tau,o} \right)} +} \in I} \right\rbrack}^{- 1}}},} & (53)\end{matrix}$

where

is the L_(τ,o)×L_(τ,o) identity matrix and ϵ is a positive constantwhich can be defined by the user. The final weights may be calculatedfrom w_(l)(τ,o)=√{square root over (q_(l)(τ,o) )} and the mixture ateach TF point may be derived from:

(τ,o)=Σ_(l) ^(L) ^(τ,o) w _(l)(τ,o){circumflex over (X)}_(l)(τ,o),  (54)

Observe that the mixing weights are real and positive, i.e., w_(l)(τ,o)ϵ

⁺, l=1, . . . , L_(τ,o).

Observe that the equality constraint

^(T)q(τ,o)=1 in Eq. (23) is formulated with the L_(τ,o)×1 of all-onesvector. In a more general scheme, the technique can be implemented withany fixed real vector d, such that all elements of d are positive. Inthis case the constraint becomes d^(T)q(τ,o)=1.

The solution in the general case then reads:

$\begin{matrix}{{{q\left( {\tau,o} \right)} = \frac{\left\lbrack {{{R\left( {\tau,o} \right)} +} \in I} \right\rbrack^{- 1}d}{{d^{T}\left\lbrack {{{R\left( {\tau,o} \right)} +} \in I} \right\rbrack}^{- 1}d}},} & (55)\end{matrix}$

and the mixing weights are derived from w_(l)(τ,o)=√{square root over(q_(l)(τ,o))}.

An alternative to the previous approach may be defined based on adifferent version of the covariance matrix that relies on priorinformation regarding the power spectral densities of the participatingUGRs. More particularly, given an a-priori estimation of the energy ateach subband index P_(i)(o), ∀_(i)=1, . . . , M, the new covariancematrix {circumflex over (R)} is again a diagonal matrix defined through:

$\begin{matrix}{{{R_{u}\left( {\tau,o} \right)} = \frac{E\left\{ {{{\hat{X}}_{l}^{H}\left( {\tau,o} \right)}{{\hat{X}}_{l}\left( {\tau,o} \right)}} \right\}}{{\hat{P}}_{l}(o)}},} & (56)\end{matrix}$

where {circumflex over (P)}_(i)(o) is equal to the prior of the i^(th)selected UGR in subband o. Depending on how the constraint isformulated, this will result to a new solution {circumflex over(q)}(τ,o), calculated either in the context of Eq. (24) or in thecontext of Eq. (26). The final signal can be in either case calculatedwith use of Eq. (25) using:

w _(l)(τ,o)=√{square root over (q _(l)(τ,o))}  (57)

where {circumflex over (q)}_(l)(τ,o) is the l^(th) element of{circumflex over (q)}(τ,o).

Note that using

_((τ,o)), the final time-domain signal can be designed using the inverseFourier transform. The approaches presented above may also be extendedto the case where each subband region consist of exactly one frequencybin.

As shown above, the steps of the method 300 may implement varioustechniques to facilitate mixing of UGRs. While each of theabove-described steps of the method 300 have been demonstrated toprovide certain technical improvements and advantages, it is to beunderstood that the steps of the method 300, and their associatedtechnical improvements and advantages, are not dependent upon thespecific manner in which other steps are implemented. For example, thevarious mixing techniques described above may be utilized by the mixingstep 350 to mix UGRs without requiring execution of the normalizationstep 340; however, it is noted that where the normalization step 340 isalso utilized, the quality of the mixed content resulting from themixing step 350 may be improved. Thus, the method 300 provides newtechnological processes that improve the functioning of computer systemsby enabling the automation of various tasks for mixing UGR content.

FIG. 6 shows a block diagram illustrating exemplary aspects of a mixingcontroller for mixing audio samples in accordance with embodiments ofthe present disclosure.

Mixing Controller

Referring now to FIG. 6, a block diagram illustrating exemplary aspectsof a mixing controller for mixing audio samples in accordance withembodiments of the present disclosure is shown as mixing controller 601.In aspects, the mixing controller 601 may serve to mix UGR contentgenerated by one or more users in accordance with aspects of the presentdisclosure, as described herein.

Typically, users, which may be people and/or other systems, may engageinformation technology systems (e.g., computers, mobile devices, such astablet computing devices and/or smart phones, and the like) to capturemultimedia content (e.g., audio content and/or audio and video content),referred to as UGRs, which may be obtained by the mixing controller 601.In aspects, the mixing controller 601 may obtain the UGRs from the usersdirectly (e.g., the UGRs may be uploaded to the mixing controller 601).In additional or alternative aspects, the mixing controller 601 mayobtain the UGRs from one or more external systems, such as social mediaplatforms or other Internet-based platforms where users maypublish/provide UGR data. The mixing controller 601 may then, in turn,employ processors to process the UGRs (e.g., to mix the UGRs and/orperform other operations in accordance with aspects of embodiments);such processors 603 may be referred to as central processing units(CPU). One form of processor is referred to as a microprocessor. CPUsuse communicative circuits to pass binary encoded signals acting asinstructions (e.g., instructions 630) to enable various operations.These instructions may be operational and/or data instructionscontaining and/or referencing other instructions and data in variousprocessor accessible and operable areas of memory 629 (e.g., registers,cache memory, random access memory, etc.). Such communicativeinstructions may be stored and/or transmitted in batches (e.g., batchesof instructions) as programs and/or data components to facilitatedesired operations. These stored instruction codes (e.g., programs), mayengage the CPU circuit components and other motherboard and/or systemcomponents to perform desired operations. One type of program is acomputer operating system, which may be executed by a CPU on a computer.The operating system enables and facilitates users to access and operatecomputer information technology and resources. Some resources that maybe employed in information technology systems include input and outputmechanisms through which data may pass into and out of a computer,memory storage into which data may be saved, and processors by whichinformation may be processed. These information technology systems maybe used to collect data for later retrieval, analysis and manipulation,which may be facilitated through a database program. These informationtechnology systems provide interfaces that allow users to access andoperate various system components.

In one embodiment, the mixing controller 601 may be connected to and/orcommunicate with entities such as, but not limited to: one or more usersfrom user input devices 611; peripheral devices 612; an optionalcryptographic process device 628; and/or a communications network 613.

Networks are commonly thought to comprise the interconnection andinteroperation of clients, servers, and intermediary nodes in a graphtopology. It should be noted that the term “server” as used throughoutthis application refers generally to a computer, other device, program,or combination thereof that processes and responds to the requests ofremote users across a communications network. Servers serve theirinformation to requesting “clients.” The term “client” as used hereinrefers generally to a computer, program, other device, user and/orcombination thereof that is capable of processing and making requestsand obtaining and processing any responses from servers across acommunications network. A computer, other device, program, orcombination thereof that facilitates, processes information andrequests, and/or furthers the passage of information from a source userto a destination user is commonly referred to as a “node.” Networks aregenerally thought to facilitate the transfer of information from sourcepoints to destinations. A node specifically tasked with furthering thepassage of information from a source to a destination is commonly calleda “router.” There are many forms of networks such as Local Area Networks(LANs), Pico networks, Wide Area Networks (WANs), Wireless Networks(WLANs), etc. For example, the Internet is generally accepted as beingan interconnection of a multitude of networks whereby remote clients andservers may access and interoperate with one another.

The mixing controller 601 may be based on computer systems that maycomprise, but are not limited to, components such as: a computersystemization 602 connected to memory 629.

Computer Systemization

A computer systemization 602 may comprise a clock 630, centralprocessing unit (“CPU(s)” and/or “processor(s)” (these terms are usedinterchangeably throughout the disclosure unless noted to the contrary))603, a memory 629 (e.g., a read only memory (ROM) 606, a random accessmemory (RAM) 605, etc.), and/or an interface bus 607, and mostfrequently, although not necessarily, are all interconnected and/orcommunicating through a system bus 604 on one or more (mother)board(s)602 having conductive and/or otherwise transportive circuit pathwaysthrough which instructions (e.g., binary encoded signals) may travel toeffectuate communications, operations, storage, etc. The computersystemization may be connected to a power source 686; e.g., optionallythe power source may be internal. Optionally, a cryptographic processor626 and/or transceivers (e.g., ICs) 674 may be connected to the systembus. In another embodiment, the cryptographic processor and/ortransceivers may be connected as either internal and/or externalperipheral devices 612 via the interface bus I/O. In turn, thetransceivers may be connected to antenna(s) 675, thereby effectuatingwireless transmission and reception of various communication and/orsensor protocols; for example the antenna(s) may connect to: a TexasInstruments WiLink WL1283 transceiver chip (e.g., providing 802.11n,Bluetooth 3.0, FM, Global Positioning System (GPS) (thereby allowingmixing controller 601 to determine its location)); Broadcom BCM4329FKUBGtransceiver chip (e.g., providing 802.11n, Bluetooth 2.1+EDR, FM, etc.);a Broadcom BCM4750IUB8 receiver chip (e.g., GPS); an InfineonTechnologies X-Gold 618-PMB9800 (e.g., providing 2G/3G HSDPA/HSUPAcommunications); and/or the like. The system clock typically has acrystal oscillator and generates a base signal through the computersystemization's circuit pathways. The clock is typically coupled to thesystem bus and various clock multipliers that will increase or decreasethe base operating frequency for other components interconnected in thecomputer systemization. The clock and various components in a computersystemization drive signals embodying information throughout the system.Such transmission and receipt of instructions embodying informationthroughout a computer systemization may be commonly referred to ascommunications. These communicative instructions may further betransmitted, received, and the cause of return and/or replycommunications beyond the instant computer systemization to:communications networks, input devices, other computer systemizations,peripheral devices, and/or the like. It should be understood that inalternative embodiments, any of the above components may be connecteddirectly to one another, connected to the CPU, and/or organized innumerous variations employed as exemplified by various computer systems.

The CPU comprises at least one high-speed data processor adequate toexecute program components for executing user and/or system-generatedrequests. Often, the processors themselves will incorporate variousspecialized processing units, such as, but not limited to: integratedsystem (bus) controllers, memory management control units, floatingpoint units, and even specialized processing sub-units like graphicsprocessing units, digital signal processing units, and/or the like.Additionally, processors may include internal fast access addressablememory, and be capable of mapping and addressing memory 629 beyond theprocessor itself; internal memory may include, but is not limited to:fast registers, various levels of cache memory (e.g., level 1, 2, 3,etc.), RAM, etc. The processor may access this memory through the use ofa memory address space that is accessible via instruction address, whichthe processor can construct and decode allowing it to access a circuitpath to a specific memory address space having a memory state. The CPUmay be a microprocessor such as: AMD's Athlon, Duron and/or Opteron;ARM's application, embedded and secure processors; IBM and/or Motorola'sDragonBall and PowerPC; IBM's and Sony's Cell processor; Intel'sCeleron, Core (2) Duo, Itanium, Pentium, Xeon, and/or XScale; and/or thelike processor(s). The CPU interacts with memory through instructionpassing through conductive and/or transportive conduits (e.g., (printed)electronic and/or optic circuits) to execute stored instructions (i.e.,program code) according to convention data processing techniques. Suchinstruction passing facilitates communication within the mixingcontroller and beyond through various interfaces. Should processingrequirements dictate a greater amount of speed and/or capacity,distributed processors (e.g., a distributed mixing controller),mainframe, multi-core, parallel, and/or super-computer architectures maysimilarly be employed. Alternatively, should deployment requirementsdictate greater portability, smaller Personal Digital Assistants (PDAs),laptop computing devices, or other portable devices configured inaccordance with embodiments of the present disclosure may be employed.

Depending on the particular implementation, features of the mixingcontroller may be achieved by implementing a microcontroller such asCAST's R8051XC2 microcontroller; Intel's MCS 51 (i.e., 8051microcontroller); and/or the like. Also, to implement certain featuresof the mixing controller, some feature implementations may rely onembedded components, such as: Application-Specific Integrated Circuit(“ASIC”), Digital Signal Processing (“DSP”), Field Programmable GateArray (“FPGA”), and/or the like embedded technology. For example, any ofthe mixing controller component collection (distributed or otherwise)and/or features may be implemented via the microprocessor and/or viaembedded components; e.g., via ASIC, coprocessor, DSP, FPGA, and/or thelike. Alternately, some implementations of the mixing controller may beimplemented with embedded components that are configured and used toachieve a variety of features or signal processing.

Depending on the particular implementation, the embedded components mayinclude software solutions, hardware solutions, and/or some combinationof both hardware/software solutions. For example, mixing controllerfeatures discussed herein may be achieved through implementing FPGAs,which are semiconductor devices containing programmable logic componentscalled “logic blocks,” and programmable interconnects, such as the highperformance FPGA Virtex series and/or the low cost Spartan seriesmanufactured by Xilinx. Logic blocks and interconnects can be programmedby the customer or designer, after the FPGA is manufactured, toimplement any of the mixing controller features. A hierarchy ofprogrammable interconnects allow logic blocks to be interconnected asneeded by the mixing controller system designer/administrator, somewhatlike a one-chip programmable breadboard. An FPGA's logic blocks can beprogrammed to perform the operation of basic logic gates such as AND,and XOR, or more complex combinational operators such as decoders ormathematical operations. In most FPGAs, the logic blocks also includememory elements, which may be circuit flip-flops or more complete blocksof memory. In some circumstances, the mixing controller may be developedon regular FPGAs and then migrated into a fixed version that moreresembles ASIC implementations. Alternate or coordinatingimplementations may migrate mixing controller features to a final ASICinstead of or in addition to FPGAs. Depending on the implementation, allof the aforementioned embedded components and microprocessors may beconsidered the “CPU” and/or “processor” for the mixing controller.

Power Source

The power source 686 may be of any standard form for powering smallelectronic circuit board devices such as the following power cells:alkaline, lithium hydride, lithium ion, lithium polymer, nickel cadmium,solar cells, and/or the like. Other types of AC or DC power sources maybe used as well. In the case of solar cells, in one embodiment, the caseprovides an aperture through which the solar cell may capture photonicenergy. The power cell 686 is connected to at least one of theinterconnected subsequent components of the mixing controller therebyproviding an electric current to all subsequent components. In oneexample, the power source 686 is connected to the system bus component604. In an alternative embodiment, an outside power source 686 isprovided through a connection across the I/O 608 interface. For example,a USB and/or IEEE 1394 connection carries both data and power across theconnection and is therefore a suitable source of power.

Interface Adapters

Interface bus(es) 607 may accept, connect, and/or communicate to anumber of interface adapters, conventionally although not necessarily inthe form of adapter cards, such as but not limited to: input outputinterfaces (I/O) 608, storage interfaces 609, network interfaces 610,and/or the like. Optionally, cryptographic processor interfaces 627similarly may be connected to the interface bus. The interface busprovides for the communications of interface adapters with one anotheras well as with other components of the computer systemization.Interface adapters are adapted for a compatible interface bus. Interfaceadapters conventionally connect to the interface bus via a slotarchitecture. Conventional slot architectures may be employed, such as,but not limited to: Accelerated Graphics Port (AGP), Card Bus, (ExtendedIndustry Standard Architecture ((E)ISA), Micro Channel Architecture(MCA), NuBus, Peripheral Component Interconnect (Extended) (PCI(X)), PCIExpress, Personal Computer Memory Card International Association(PCMCIA), and/or the like.

Storage interfaces 609 may accept, communicate, and/or connect to anumber of storage devices such as, but not limited to: (Ultra) (Serial)Advanced Technology Attachment (Packet Interface) ((Ultra) (Serial)ATA(PI)), (Enhanced) Integrated Drive Electronics ((E)IDE), Institute ofElectrical and Electronics Engineers (IEEE) 1394, fiber channel, SmallComputer Systems Interface (SCSI), Universal Serial Bus (USB), and/orthe like.

Network interfaces 610 may accept, communicate, and/or connect to acommunications network 613. Through a communications network 613, themixing controller is accessible through remote clients 633 b (e.g.,computers and other electronic devices capable of generating and/orcommunicating UGR content to the mixing controller via a local ornetwork-based connection) by users 633 a. Network interfaces may employconnection protocols such as, but not limited to: direct connect,Ethernet (thick, thin, twisted pair 10/100/1000 Base T, and/or thelike), Token Ring, wireless connection such as IEEE 802.11a-x, and/orthe like. Should processing requirements dictate a greater amount ofspeed and/or capacity, distributed network controllers (e.g.,distributed mixing controller), architectures may similarly be employedto pool, load balance, and/or otherwise increase the communicativebandwidth required by the mixing controller. A communications networkmay be any one and/or the combination of the following: a directinterconnection, the Internet, a Local Area Network (LAN), aMetropolitan Area Network (MAN), an Operating Missions as Nodes on theInternet (OMNI), a secured custom connection, a Wide Area Network (WAN),a wireless network (e.g., employing protocols such as, but not limitedto a Wireless Application Protocol (WAP), I-mode, and/or the like),and/or the like. A network interface may be regarded as a specializedform of an input output interface. Further, multiple network interfaces610 may be used to engage with various communications network types 613.For example, multiple network interfaces may be employed to allow forthe communication over broadcast, multicast, and/or unicast networks.

Input Output interfaces (I/O) 308 may accept, communicate, and/orconnect to user input devices 611, peripheral devices 612, cryptographicprocessor devices 628, and/or the like. I/O may employ connectionprotocols such as, but not limited to: audio: analog, digital, monaural,RCA, stereo, and/or the like; data: Apple Desktop Bus (ADB), IEEE1394a-b, serial, universal serial bus (USB), infrared, joystick,keyboard, midi, optical, PC AT, PS/2, parallel, radio; video interface:Apple Desktop Connector (ADC), BNC, coaxial, component, composite,digital, Digital Visual Interface (DVI), high-definition multimediainterface (HDMI), RCA, RF antennae, S-Video, VGA, and/or the like;wireless transceivers: 802.11a/b/gin/x; Bluetooth, cellular (e.g., codedivision multiple access (CDMA), high speed packet access (HSPA(+)),high speed downlink packet access (HSDPA), global system for mobilecommunications (GSM), long term evolution (LTE), WiMax, etc.), and/orthe like. One typical output device may include a video display, whichtypically comprises a Cathode Ray Tube (CRT) or Liquid Crystal Display(LCD) based monitor with an interface (e.g., DVI circuitry and cable)that accepts signals from a video interface. The video interfacecomposites information is generated by a computer systemization andgenerates video signals based on the composited information in a videomemory frame. Another output device is a television set, which acceptssignals from a video interface. Typically, the video interface providesthe composited video information through a video connection interfacethat accepts a video display interface (e.g., an RCA composite videoconnector accepting an RCA composite video cable; a DVI connectoraccepting a DVI display cable, etc.).

User input devices 611 often are a type of peripheral device 612 (seebelow) and may include: card readers, dongles, finger print readers,gloves, graphics tablets, joysticks, keyboards, microphones, mouse(mice), remote controls, retina readers, touch screens (e.g.,capacitive, resistive, etc.), trackballs, trackpads, sensors (e.g.,accelerometers, ambient light, GPS, gyroscopes, proximity, etc.),styluses, and/or the like.

Peripheral devices 612 may be connected and/or communicate to I/O and/orother facilities of the like such as network interfaces, storageinterfaces, directly to the interface bus, system bus, the CPU, and/orthe like. Peripheral devices may be external, internal and/or part ofthe mixing controller. Peripheral devices may include: antenna, audiodevices (e.g., line-in, line-out, microphone input, speakers, etc.),cameras (e.g., still, video, webcam, etc.), dongles (e.g., for copyprotection, ensuring secure transactions with a digital signature,and/or the like), external processors (for added capabilities; e.g.,crypto devices 628), force-feedback devices (e.g., vibrating motors),network interfaces, printers, scanners, storage devices, transceivers(e.g., cellular, GPS, etc.), video devices (e.g., goggles, monitors,etc.), video sources, visors, and/or the like. Peripheral devices ofteninclude types of input devices (e.g., cameras).

It should be noted that although user input devices and peripheraldevices may be employed, the mixing controller may be embodied as anembedded, dedicated, and/or monitor-less (i.e., headless) devices,wherein access would be provided over a network interface connection.

Cryptographic units such as, but not limited to, microcontrollers,processors 626, interfaces 627, and/or devices 628 may be attached,and/or communicate with the mixing controller. A MC68HC16microcontroller, manufactured by Motorola Inc., may be used for and/orwithin cryptographic units. The MC68HC16 microcontroller utilizes a16-bit multiply-and-accumulate instruction in the 16 MHz configurationand requires less than one second to perform a 512-bit RSA private keyoperation. Cryptographic units support the authentication ofcommunications from interacting agents, as well as allowing foranonymous transactions. Cryptographic units may also be configured aspart of the CPU. Equivalent microcontrollers and/or processors may alsobe used. Other commercially available specialized cryptographicprocessors include: Broadcom's CryptoNetX and other Security Processors,nCipher's nShield, SafeNet's Luna PCI (e.g., 7100) series, SemaphoreCommunications' 40 MHz Roadrunner 184, Sun's Cryptographic Accelerators(e.g., Accelerator 6000 PCIe Board, Accelerator 500 Daughtercard), ViaNano Processor (e.g., L2100, L2200, U2400) line, which is capable ofperforming 500+MB/s of cryptographic instructions; VLSI Technology's 33MHz 6868, and/or the like.

Memory

Generally, any mechanization and/or embodiment allowing a processor toaffect the storage and/or retrieval of information is regarded as memory629. However, memory is a fungible technology and resource, thus, anynumber of memory embodiments may be employed in lieu of or in concertwith one another. It is to be understood that the mixing controllerand/or a computer systemization may employ various forms of memory 629.For example, a computer systemization may be configured where theoperation of on-chip CPU memory (e.g., registers), RAM, ROM, and anyother storage devices are provided by a paper punch tape or paper punchcard mechanism; however, such an embodiment would result in an extremelyslow rate of operation. In a typical configuration, memory 629 willinclude ROM 606, RAM 605, and a storage device 614. A storage device 614may be an conventional computer system storage. Storage devices mayinclude a drum, a (fixed and/or removable) magnetic disk drive, amagneto-optical drive, an optical drive (i.e., Blu-ray, CDROM/RAM/Recordable (R)/ReWritable (RW), DVD R/RW, HD DVD R/RW, etc.), anarray of devices (e.g. Redundant Array of Independent Disks (RAID)),solid state memory devices (USB memory, solid state drives (SSD), etc.),other processor-readable storage mediums, and/or other devices of thelike. Thus, a computer systemization generally requires and makes use ofmemory.

Component Collection

The memory 629 may contain a collection of program and/or databasecomponents and/or data such as, but not limited to: operating systemcomponent(s) 615 (operating system), information server component(s)(information server), user interface component(s) (user interface); Webbrowser component(s) (Web browser), UGR database(s) 619, pairing module620, clustering module 622, synchronization module 624, normalizationmodule 626, mixing module 628; the mixing component(s) 635, and/or thelike (i.e., collectively a component collection). In aspects, thepairing module 620 may be configured to perform operations correspondingto the step 110 of FIG. 1 and/or steps 310 and 320 of FIG. 3, asdescribed above; the clustering module 622 may be configured to performoperations corresponding to the step 120 of FIG. 1 and/or step 330 ofFIG. 3, as described above; the synchronization module 624 may beconfigured to perform operations corresponding to the step 130 of FIG. 1and step 330 (e.g., the synchronization portion) of FIG. 3, as describedabove; the normalization module 626 may be configured to performoperations corresponding to the step 140 of FIG. 1 and/or step 340 ofFIG. 3, as described above; and the mixing module 628 may be configuredto perform operations corresponding to the step 150 of FIG. 1 and/or thestep 350 of FIG. 3, as described above. These components may be storedand accessed from the storage devices and/or from storage devicesaccessible through an interface bus. Although non-conventional programcomponents such as those in the component collection, typically, arestored in a local storage device 614, they may also be loaded and/orstored in memory such as peripheral devices, RAM, remote storagefacilities through a communications network, ROM, various forms ofmemory, and/or the like.

Operating System

The operating system component 615 is an executable program componentfacilitating the operation of the mixing controller. Typically, theoperating system facilitates access of I/O, network interfaces,peripheral devices, storage devices, and/or the like. The operatingsystem may be a highly fault tolerant, scalable, and secure system suchas: Apple Macintosh OS X (Server), AT&T Plan 9, Be OS, Unix andUnix-like system distributions (such as AT&T's UNIX, Berkley SoftwareDistribution (BSD) variations such as FreeBSD, NetBSD, OpenBSD, and/orthe like, Linux distributions such as Red Hat, Ubuntu, and/or the like),and/or the like operating systems. However, more limited and/or lesssecure operating systems may also be employed such as Apple MacintoshOS, IBM OS/2, Microsoft DOS, Microsoft Windows2000/2003/3.1/95/98/CE/Millennium/NT/Vista/XP (Server), Palm OS, and/orthe like. An operating system may communicate to and/or with othercomponents in a component collection, including itself, and/or the like.Most frequently, the operating system communicates with other programcomponents, user interfaces, and/or the like. For example, the operatingsystem may contain, communicate, generate, obtain, and/or provideprogram component, system, user, and/or data communications, requests,and/or responses. The operating system, once executed by the CPU, mayenable the interaction with communications networks, data, I/O,peripheral devices, program components, memory, user input devices,and/or the like. The operating system may provide communicationsprotocols that allow the mixing controller to communicate with otherentities through a communications network 613. Various communicationprotocols may be used by the mixing controller as a subcarrier transportmechanism for interaction, such as, but not limited to: multicast,TCP/IP, UDP, unicast, and/or the like.

Information Server

An information server component 616 is a stored program component thatis executed by a CPU. The information server may be a conventionalInternet information server such as, but not limited to Apache SoftwareFoundation's Apache, Microsoft's Internet Information Server, and/or thelike. The information server may allow for the execution of programcomponents through facilities such as Active Server Page (ASP), ActiveX,(ANSI) (Objective-) C (++), C# and/or .NET, Common Gateway Interface(CGI) scripts, dynamic (D) hypertext markup language (HTML), FLASH,Java, JavaScript, Practical Extraction Report Language (PERL), HypertextPre-Processor (PHP), pipes, Python, wireless application protocol (WAP),WebObjects, and/or the like. The information server may support securecommunications protocols such as, but not limited to, File TransferProtocol (FTP), Hypertext Transfer Protocol (HTTP), Secure HypertextTransfer Protocol (HTTPS), Secure Socket Layer (SSL), messagingprotocols (e.g., America Online (AOL) Instant Messenger (AIM),Application Exchange (APEX), ICQ, Internet Relay Chat (IRC), MicrosoftNetwork (MSN) Messenger Service, Presence and Instant Messaging Protocol(PRIM), Internet Engineering Task Force's (IETF's) Session InitiationProtocol (SIP), SIP for Instant Messaging and Presence LeveragingExtensions (SIMPLE), open XML-based Extensible Messaging and PresenceProtocol (XMPP) (i.e., Jabber or Open Mobile Alliance's (OMA's) InstantMessaging and Presence Service (IMPS)), Yahoo! Instant MessengerService, and/or the like. The information server provides results in theform of Web pages to Web browsers, and allows for the manipulatedgeneration of the Web pages through interaction with other programcomponents. After a Domain Name System (DNS) resolution portion of anHTTP request is resolved to a particular information server, theinformation server resolves requests for information at specifiedlocations on the mixing controller based on the remainder of the HTTPrequest. For example, a request such ashttp://123.124.125.126/myInformation.html might have the IP portion ofthe request “123.124.125.126” resolved by a DNS server to an informationserver at that IP address; that information server might in turn furtherparse the http request for the “/myInformation.html” portion of therequest and resolve it to a location in memory containing theinformation “myInformation.html.” Additionally, other informationserving protocols may be employed across various ports, e.g., FTPcommunications across port 21, and/or the like. An information servermay communicate to and/or with other components in a componentcollection, including itself, and/or facilities of the like. Mostfrequently, the information server communicates with the UGR database619, operating systems, other program components, user interfaces, Webbrowsers, and/or the like.

Access to the UGR database may be achieved through a number of databasebridge mechanisms such as through scripting languages as enumeratedbelow (e.g., CGI) and through inter-application communication channelsas enumerated below (e.g., CORBA, WebObjects, etc.). Any data requeststhrough a Web browser are parsed through the bridge mechanism intoappropriate grammars as required by the mixing controller. In oneembodiment, the information server would provide a Web form accessibleby a Web browser. Entries made into supplied fields in the Web form aretagged as having been entered into the particular fields, and parsed assuch. The entered terms are then passed along with the field tags, whichact to instruct the parser to generate queries directed to appropriatetables and/or fields. In one embodiment, the parser may generate queriesin standard SQL by instantiating a search string with the properjoin/select commands based on the tagged text entries, wherein theresulting command is provided over the bridge mechanism to the mixingcontroller as a query. Upon generating query results from the query, theresults are passed over the bridge mechanism, and may be parsed forformatting and generation of a new results Web page by the bridgemechanism. Such a new results Web page is then provided to theinformation server, which may supply it to the requesting Web browser.

Also, an information server may contain, communicate, generate, obtain,and/or provide component, system, user, and/or data communications,requests, and/or responses.

User Interface

Computer interfaces in some respects are similar to automobile operationinterfaces. Automobile operation interface elements such as steeringwheels, gearshifts, and speedometers facilitate the access, operation,and display of automobile resources, and status. Computer interactioninterface elements such as check boxes, cursors, menus, scrollers, andwindows (collectively and commonly referred to as widgets) similarlyfacilitate the access, capabilities, operation, and display of data andcomputer hardware and operating system resources, and status. Operationinterfaces are commonly called user interfaces. Graphical userinterfaces (GUIs) such as the Apple Macintosh Operating System's Aqua,IBM's OS/2, Microsoft's Windows2000/2003/3.1/95/98/CE/Millenium/NT/XP/Vista/7 (i.e., Aero), Unix'sX-Windows (e.g., which may include additional Unix graphic interfacelibraries and layers such as K Desktop Environment (KDE), mythTV and GNUNetwork Object Model Environment (GNOME)), web interface libraries(e.g., ActiveX, AJAX, (D)HTML, FLASH, Java, JavaScript, etc. interfacelibraries such as, but not limited to, Dojo, jQuery(UI), MooTools,Prototype, script.aculo.us, SWFObject, Yahoo! User Interface, any ofwhich may be used and) provide a baseline and means of accessing anddisplaying information graphically to users.

A user interface component may be a stored program component that isexecuted by a CPU. The user interface may be a conventional graphic userinterface as provided by, with, and/or atop operating systems and/oroperating environments such as already discussed. The user interface mayallow for the display, execution, interaction, manipulation, and/oroperation of program components and/or system facilities through textualand/or graphical facilities. The user interface provides a facilitythrough which users may affect, interact, and/or operate a computersystem. A user interface may communicate to and/or with other componentsin a component collection, including itself, and/or facilities of thelike. Most frequently, the user interface communicates with operatingsystems, other program components, and/or the like. The user interfacemay contain, communicate, generate, obtain, and/or provide programcomponent, system, user, and/or data communications, requests, and/orresponses.

Web Browser

A Web browser component 318 is a stored program component that isexecuted by a CPU. The Web browser may be a conventional hypertextviewing application such as Microsoft Internet Explorer or NetscapeNavigator. Secure Web browsing may be supplied with 128 bit (or greater)encryption by way of HTTPS, SSL, and/or the like. Web browsers allowingfor the execution of program components through facilities such asActiveX, AJAX, (D)HTML, FLASH, Java, JavaScript, web browser plug-inAPIs (e.g., FireFox, Safari Plug-in, and/or the like APIs), and/or thelike. Web browsers and like information access tools may be integratedinto PDAs, cellular telephones, and/or other mobile devices. A Webbrowser may communicate to and/or with other components in a componentcollection, including itself, and/or facilities of the like. Mostfrequently, the Web browser communicates with information servers,operating systems, integrated program components (e.g., plug-ins),and/or the like; e.g., it may contain, communicate, generate, obtain,and/or provide program component, system, user, and/or datacommunications, requests, and/or responses. Also, in place of a Webbrowser and information server, a combined application may be developedto perform similar operations of both. The combined application wouldsimilarly affect the obtaining and the provision of information tousers, user agents, and/or the like from the mixing controller enablednodes. The combined application may be nugatory on systems employingstandard Web browsers.

The UGR Database

The UGR database component 619 may be embodied in a database and itsstored data. The database is a stored program component, which isexecuted by the CPU; the stored program component portion configuringthe CPU to process the stored data. The database may be a conventional,fault tolerant, relational, scalable, secure database such as Oracle orSybase. Relational databases are an extension of a flat file. Relationaldatabases consist of a series of related tables. The tables areinterconnected via a key field. Use of the key field allows thecombination of the tables by indexing against the key field; i.e., thekey fields act as dimensional pivot points for combining informationfrom various tables. Relationships generally identify links maintainedbetween tables by matching primary keys. Primary keys represent fieldsthat uniquely identify the rows of a table in a relational database.More precisely, the uniquely identify rows of a table on the “one” sideof a one-to-many relationship.

Alternatively, the UGR database may be implemented using variousstandard data-structures, such as an array, hash, (linked) list, struct,structured text file (e.g., XML), table, and/or the like. Suchdata-structures may be stored in memory and/or in (structured) files. Inanother alternative, an object-oriented database may be used, such asFrontier, ObjectStore, Poet, Zope, and/or the like. Object databases caninclude a number of object collections that are grouped and/or linkedtogether by common attributes; they may be related to other objectcollections by some common attributes. Object-oriented databases performsimilarly to relational databases with the exception that objects arenot just pieces of data but may have other types of capabilitiesencapsulated within a given object. If the UGR database is implementedas a data-structure, the use of the UGR database 619 may be integratedinto another component such as the mixing component 635. Also, thedatabase may be implemented as a mix of data structures, objects, andrelational structures. Databases may be consolidated and/or distributedin countless variations through standard data processing techniques.Portions of databases, e.g., tables, may be exported and/or imported andthus decentralized and/or integrated.

In one embodiment, the UGR database component 619 includes UGR data 619a-n. In one embodiment, the UGR data 619 a-n includes UGR contentreceived from, obtained from, or generated by n users, and/or the like.

In one embodiment, the UGR database may interact with other databasesystems. For example, employing a distributed database system, queriesand data access by search mixing component may treat the combination ofthe UGR database, an integrated data security layer database as a singledatabase entity.

In one embodiment, user programs may contain various user interfaceprimitives, which may serve to update the mixing controller. Also,various accounts may require custom database tables depending upon theenvironments and the types of clients the mixing controller may need toserve. It should be noted that any unique fields may be designated as akey field throughout. In an alternative embodiment, these tables havebeen decentralized into their own databases and their respectivedatabase controllers (i.e., individual database controllers for each ofthe above tables). Employing standard data processing techniques, onemay further distribute the databases over several computersystemizations and/or storage devices. Similarly, configurations of thedecentralized database controllers may be varied by consolidating and/ordistributing the various database components 619 a-n. The mixingcontroller may be configured to keep track of various settings, inputs,and parameters via database controllers.

The UGR database may communicate to and/or with other components in acomponent collection, including itself, and/or facilities of the like.Most frequently, the UGR database communicates with the mixingcomponent, other program components, and/or the like. The database maycontain, retain, and provide information regarding other nodes and data.

The Mixing Component

The mixing component 635 is a stored program component that is executedby a CPU. In one embodiment, the mixing component incorporates anyand/or all combinations of the aspects of the mixing controller thatwere discussed in the previous figures. As such, the mixing controlleraffects accessing, obtaining and the provision of information, services,transactions, and/or the like across various communications networks.The features and embodiments of the mixing component discussed hereinincrease network efficiency by reducing data transfer requirements anduse of more efficient data structures and mechanisms for their transferand storage. As a consequence, more data may be transferred in lesstime, and latencies with regard to transactions are also reduced. Inmany cases, such reduction in storage, transfer time, bandwidthrequirement, latencies, etc., will reduce the capacity and structuralinfrastructure requirements to support the mixing controller's featuresand facilities, and in many cases reduce the costs, energyconsumption/requirements, and extend the life of mixing controller'sunderlying infrastructure; this has the added benefit of making themixing controller more reliable. Similarly, many of the features andmechanisms are designed to be easier for users to use and access,thereby broadening the audience that may enjoy/employ and exploit thefeature sets of the mixing controller; such ease of use also helps toincrease the reliability of the mixing controller.

The mixing component enabling access of information between nodes may bedeveloped by employing standard development tools and languages such as,but not limited to: Apache components, Assembly, ActiveX, binaryexecutables, (ANSI) (Objective-) C (++), C# and/or .NET, databaseadapters, CGI scripts, Java, JavaScript, mapping tools, procedural andobject oriented development tools, PERL, PHP, Python, shell scripts, SQLcommands, web application server extension, web development environmentsand libraries (e.g., Microsoft's ActiveX, Adobe AIR, FLEX & FLASH, AJAX,(D)HTML, Dojo, Java, JavaScript, jQuery(UI), MooTools, Prototype,script.aculo.us, Simple Object Access Protocol (SOAP), SWFObject, Yahoo!User Interface, and/or the like), WebObjects, and/or the like. In oneembodiment, the mixing controller server employs a cryptographic serverto encrypt and decrypt communications. The mixing component maycommunicate to and/or with other components in a component collection,including itself, and/or facilities of the like. Most frequently, themixing component communicates with the UGR database, operating systems,other program components, and/or the like. The mixing controller maycontain, communicate, generate, obtain, and/or provide programcomponent, system, user, and/or data communications, requests, and/orresponses.

Distributed Mixing Controllers

The structure and/or operation of any of the mixing controllercomponents may be combined, consolidated, and/or distributed in anynumber of ways to facilitate development and/or deployment. Similarly,the component collection may be combined in any number of ways tofacilitate deployment and/or development. To accomplish this, one mayintegrate the components into a common code base or in a facility thatcan dynamically load the components on demand in an integrated fashion.

The component collection may be consolidated and/or distributed incountless variations through standard data processing and/or developmenttechniques. Multiple instances of any one of the program components inthe program component collection may be instantiated on a single node,and/or across numerous nodes to improve performance throughload-balancing and/or data-processing techniques. Furthermore, singleinstances may also be distributed across multiple controllers and/orstorage devices, e.g., databases. All program component instances andcontrollers working in concert may do so through standard dataprocessing communication techniques.

The configuration of the mixing controller will depend on the context ofsystem deployment. Factors such as, but not limited to, the budget,capacity, location, and/or use of the underlying hardware resources mayaffect deployment requirements and configuration. Regardless of if theconfiguration results in more consolidated and/or integrated programcomponents, results in a more distributed series of program components,and/or results in some combination between a consolidated anddistributed configuration, data may be communicated, obtained, and/orprovided. Instances of components consolidated into a common code basefrom the program component collection may communicate, obtain, and/orprovide data. This may be accomplished through intra-application dataprocessing communication techniques such as, but not limited to: datareferencing (e.g., pointers), internal messaging, object instancevariable communication, shared memory space, variable passing, and/orthe like.

If component collection components are discrete, separate, and/orexternal to one another, then communicating, obtaining, and/or providingdata with and/or to other component components may be accomplishedthrough inter-application data processing communication techniques suchas, but not limited to: Application Program Interfaces (API) informationpassage, (distributed) Component Object Model ((D)COM), (Distributed)Object Linking and Embedding ((D)OLE), and/or the like), Common ObjectRequest Broker Architecture (CORBA), Jini local and remote applicationprogram interfaces, JavaScript Object Notation (JSON), Remote MethodInvocation (RMI), SOAP, process pipes, shared files, and/or the like.Messages sent between discrete component components forinter-application communication may be facilitated through the creationand parsing of a grammar. A grammar may be developed by usingdevelopment tools such as lex, yacc, XML, and/or the like, which allowfor grammar generation and parsing capabilities, which in turn may formthe basis of communication messages within and between components.

For example, a grammar may be arranged to recognize the tokens of anHTTP post command, e.g.:

w3c—post http://...Value1

where Value1 is discerned as being a parameter because “http://” is partof the grammar syntax, and what follows is considered part of the postvalue. Similarly, with such a grammar, a variable “Value1” may beinserted into an “http://” post command and then sent. The grammarsyntax itself may be presented as structured data that is interpretedand/or otherwise used to generate the parsing mechanism (e.g., a syntaxdescription text file as processed by lex, yacc, etc.). Also, once theparsing mechanism is generated and/or instantiated, it itself mayprocess and/or parse structured data such as, but not limited to:character (e.g., tab) delineated text, HTML, structured text streams,XML, and/or the like structured data. In another embodiment,inter-application data processing protocols themselves may haveintegrated and/or readily available parsers (e.g., JSON, SOAP, and/orlike parsers) that may be employed to parse (e.g. communications) data.Further, the parsing grammar may be used beyond message parsing, but mayalso be used to parse: databases, data collections, data stores,structured data, and/or the like. Again, the desired configuration willdepend upon the context, environment, and requirements of systemdeployment.

Referring to FIG. 7, a flow diagram of a method for mixing media contentbased on a plurality of UGRs in accordance with embodiments of thepresent disclosure is shown as a method 700. In an aspect, the steps ofthe method 700 may be performed by the system 600 of FIG. 6. Forexample, the method 700 may be stored as instructions (the instructions630 of FIG. 6) that, when executed by one or more processors (e.g., theone or more processors 603 of FIG. 6), cause the one or more processorsto perform operations for mixing media content based on a plurality ofUGRs in accordance with the techniques described above with respect toFIGS. 1-5.

As shown in FIG. 7, the method 700 may include, at step 710, receiving aplurality of UGRs. Each UGR of the plurality of UGRs may include atleast audio content. One or more of the UGRs may also include additionaltypes of content, such as visual content (e.g., image content, videocontent, and the like). As described above, the plurality of UGRs may beassociated with audio content captured during an event, such as aconcert, a sporting event, or other type of event. In an aspect, theplurality of UGRs may be received from one or more users. For example,the system 600 of FIG. 6 may be configured to receive UGRs from aplurality of users and to generate new audio content from the receivedUGRs using the method 700, as described in more detail below.

At step 720, the method 700 includes determining a correlation betweensamples of audio content associated with at least two UGRs of theplurality of UGRs. In an aspect, the step 720 may include operationssimilar to the step 110 of FIG. 1 and/or the steps 310 and 320 of FIG.3. At step 730, the method 700 includes generating one or more clusterscomprising samples of the audio content identified as having arelationship. The relationship between two or more samples of the audiocontent may be determined based on a correlation between the audiocontent of the at least two UGRs, which may be determined based on theclustering techniques described above with respect to FIGS. 1 and 3.Each of the one or more clusters may include a set of one or more UGRs,and each set of UGRs may be associated with a particular portion of theevent, such as a certain time period during the event. It is noted thatwhile UGRs within a set of UGRs may be associated with a particularportion of the event, this does not necessarily mean they representidentical portions of the event. To illustrate, assume a set of UGRsincludes three UGRs associated with a particular portion of the eventthat spans a time period of 10 minutes. The first UGR may include audiocontent (and possibly other content, such as visual content) associatedwith the first 5 minutes of that 10 minute time period, the second UGRmay include audio content (and possibly other content, such as visualcontent) covering the 3^(rd) through 7^(th) minutes of the 10 minutetime period, and the third UGR may include audio content (and possiblyother content, such as visual content) covering the 4^(th) through10^(th) minutes of the 10 minute time period. Thus, these three UGRs mayform a set of UGRs associated with a particular portion of the event(e.g., the portion of the event spanning the 10 minute time period). Asexplained above, these 3 UGRs may be found to have a relationship due tothe overlap in their audio content. For example, the first and secondUGRs overlap for the 3^(rd) through 5^(th) minutes of the 10 minute timeperiod; the first and third UGRs overlap for the 4^(th) and 5^(th)minutes of the 10 minute time period; and the second and third UGRsoverlap for the 4^(th) through 7^(th) minutes of the 10 minute timeperiod. It is noted that the particular example described above has beenprovided for purposes of illustration, rather than by way of limitation.Thus, the particular portion of an event associated with a set of UGRsmay be associated with longer or shorter periods of time and a sets ofUGRs may include more than 3 UGRs or may include a single UGR.

In an aspect, the method 700 may include applying weights torelationships between two or more UGRs. The weights may be appliedduring generation of the one or more clusters. For example, as explainedabove, a relationship between two or more UGRs may be identified basedon a correlation between the two or more UGRs, and the weights mayindicate a strength of the relationship between the two or more UGRs.

At step 740, the method 700 includes synchronizing, for each of the oneor more clusters, the samples of the audio content to producesynchronization data for each of the one or more clusters. As describedabove, the synchronization data may order the audio content for each ofthe clusters with respect to time. In an aspect, the synchronizing, atstep 740, may include determining a reference UGR for each of the one ormore clusters. As explained above, once the reference UGR isidentified/determined/selected for a cluster of UGRs, the remaining UGRsassociated with that cluster may be synchronized with respect to thereference UGR to produce the synchronization data for that cluster. Inan aspect, the step 740 may generate the synchronization data inaccordance with the synchronization techniques described above withreference to FIGS. 1 and 3.

At step 750, the method 700 includes normalizing media content based onthe synchronization data derived from the audio content of the UGRsincluded in each of the one or more clusters to produce normalized mediacontent. As explained above, different UGRs within a particular clustermay have different start and/or end points with respect to time. Bynormalizing the synchronized media content for each of the one or moreclusters, at step 750, audible differences across transitions (e.g.,start points and end points) between the UGRs of the clusters may beminimized, such that, after mixing, the perceptibility of a transitioncorresponding to a start and/or end of a particular UGR may be reducedand the mixed content may be more readily perceived as being generatedfrom a single recording, as opposed to mixing many different UGRstogether. In an aspect, the normalization may be performed iterativelyuntil a stop criterion is satisfied. For example, the stop criterion maybe associated with the stop criterion described above with respect toEq. (39).

At step 760, the method 700 includes mixing media content associatedwith the plurality of UGRs based at least in part on the synchronizationdata. In an aspect, the media content that is mixed may comprisenormalized media content, which may be generated at step 750. In anadditional aspect, the media content that is mixed may comprise originalmedia content (e.g., non-normalized media content). As described abovewith respect to FIGS. 1 and 3, the mixing may utilize segment-wisemixing techniques, target power mixing techniques, time-frequency domainmixing techniques, maximum component elimination mixing techniques,and/or segment-wise phase alignment mixing techniques. Additionally, itis noted that the synchronization data, which may be generated orderived based on the above-described techniques for processing audiocontent of the UGRs, may be utilized to synchronize non-audio content,such as video content of the UGRs-stated another way, UGR audio contentmay be processed using the various techniques described above togenerate synchronization data, which may then be used to synchronize theUGR audio content; to synchronize UGR video content associated with theaudio content; and/or to synchronize the UGR video content and audiocontent.

As shown above, the method 700 provides a new technological process forcreating media content from UGRs. The new technological process providedby the method 700 enables a computing device, such as the system 600 ofFIG. 6, to automatically create new media content (e.g., audio content,video content, audio/video content, and the like) from many individualrecordings—even when the different individual recordings: span differentmoments in time (e.g., different portions of an event); have differentdurations; are captured from different locations or devices, which mayimpact the characteristics and/or quality of the individual recordings;and/or include different types of content (e.g., some recordings mayinclude audio content and visual content, such as video content, whileother recordings may only include audio content). Further, duringgeneration of the new media content, the method 700 implements variousoperations to improve the overall quality of the new media content. Forexample, the operations of the method 700: identify relationshipsbetween different UGRs; group the UGRs based on identifiedrelationships; synchronize UGRs within and across each UGR group; and/ormitigate audio imperfections associated with transitions betweendifferent UGRs. Thus, the method 700 improves the manner in which acomputing device, such as the system 600 of FIG. 6, operates by enablingnew media content to be automatically generated from a plurality ofUGRs.

In order to address various issues and advance the art, the entirety ofthis application for AUDIO SAMPLE MIXING APPARATUSES, METHODS, ANDSYSTEMS (including the Cover Page, Title, Headings, Field, Background,Summary, Brief Description of the Drawings, Detailed Description,Claims, Abstract. Figures, and otherwise) shows, by way of illustration,various embodiments in which the claimed innovations may be practiced.The advantages and features of the application are or a representativesample of embodiments only, and are not exhaustive and/or exclusive.They are presented only to assist in understanding and teach the claimedprinciples. It should be understood that they are not representative ofall claimed innovations. As such, certain aspects of the disclosure havenot been discussed here. That alternate embodiments may not have beenpresented for a specific portion of the innovations or that furtherundescribed alternate embodiments may be available for a portion is notto be considered a disclaimer of those alternate embodiments. It will beappreciated that many of those undescribed embodiments incorporate thesame principles of the innovations and others are equivalent. Thus, itis to be understood that other embodiments may be utilized andfunctional, logical, operational, organization, structural and/ortopological modifications may be made without departing from the scopeand/or spirit of the disclosure. As such, all examples and/orembodiments are deemed to be non-limiting throughout this disclosure.Also, no inference should be drawn regarding those embodiments discussedherein relative to those not discussed here other than it is as such forpurposes of reducing space and repetition. For instance, it is to beunderstood that the logical and/or topological structure of anycombination of any program components (a component collection), othercomponents and/or any present features sets as described in the figuresand/or throughout are not limited to a fixed operating order and/orarrangement, but rather, any disclosed order is exemplary and allequivalents, regardless of order, are contemplated by the disclosure.Furthermore, it is to be understood that such features are not limitedto serial execution, but rather, any number of threads, processes,services, servers, and/or the like that may execute asynchronously,concurrently, in parallel, simultaneously, synchronously, and/or thelike are contemplated by the disclosure. As such, some of these featuresmay be mutually contradictory, in that they cannot be simultaneouslypresent in a single embodiment. Similarly, some features are applicableto one aspect of the innovations, and inapplicable to others. Inaddition, the disclosure includes other innovations not presentlyclaimed. Applicant reserves all rights in those presently unclaimedinnovations including the right to claim such innovations, fileadditional applications, continuations, continuations in part,divisions, and/or the like thereof. As such, it should be understoodthat advantages, embodiments, examples, functional, features, logical,operations, organizational, structural, topological, and/or otheraspects of the disclosure are not to be considered limitations on thedisclosure as defined by the claims or limitations on equivalents to theclaims. It is to be understood that, depending on the particular needsand/or characteristics of a mixing controller individual and/orenterprise user, database configuration and/or relational model, datatype, data transmission and/or network framework, syntax structure,and/or the like, various embodiments of the mixing controller, may beimplemented that enable a great deal of flexibility and customization.

Although the embodiments of the present disclosure and their advantageshave been described in detail, it should be understood that variouschanges, substitutions and alterations can be made herein withoutdeparting from the spirit and scope of the disclosure as defined by theappended claims. Moreover, the scope of the present application is notintended to be limited to the particular embodiments of the process,machine, manufacture, composition of matter, means, methods and stepsdescribed in the specification. As one of ordinary skill in the art willreadily appreciate from the present disclosure, processes, machines,manufacture, compositions of matter, means, methods, or steps, presentlyexisting or later to be developed that perform substantially the samefunction or achieve substantially the same result as the correspondingembodiments described herein may be utilized according to the presentdisclosure. Accordingly, the appended claims are intended to includewithin their scope such processes, machines, manufacture, compositionsof matter, means, methods, or steps.

1. A method for mixing audio content based on a plurality of user generated recordings (UGRs), the method comprising: receiving a plurality of UGRs, each UGR of the plurality of UGRs comprising at least audio content; determining a correlation between samples of audio content associated with at least two UGRs of the plurality of UGRs; generating one or more clusters based on the determined correlations, wherein each of the one or more clusters comprises at least one UGR; generating, for each of the one or more clusters, synchronization data based on the audio content; and mixing, for each of the one or more clusters, media content of the plurality of UGRs based at least in part on the synchronization data to produce mixed content.
 2. The method of claim 1, further comprising normalizing the media content associated with the UGRs to produce normalized media content, wherein the mixing of the media content comprises mixing the normalized media content to produce the mixed content.
 3. The method of claim 1, wherein the mixed media content comprises video content.
 4. The method of any of claim 1, wherein the plurality of UGRs are associated with audio content from an event, wherein each of the one or more clusters comprises a set of UGRs, and wherein each set of UGRs is associated with a particular portion of the event.
 5. The method of claim 4, further comprising calculating a cross-correlation between different pairs of UGRs, wherein UGRs are associated with particular portions of the event and associated with different clusters based at least in part on the cross-correlations.
 6. The method of claim 5, further comprising applying weights during the generating of the one or more clusters, the weights indicating a strength of a relationship between a pair of UGRs of the plurality of UGRs, wherein relationship is indicated by a correlation determined with respect to the pair of UGRs.
 7. The method of claim 1, wherein the synchronized data orders the media content for UGRs within each of the clusters with respect to time.
 8. The method of claim 1, wherein the normalization is configured to minimize perceptible differences across transitions between one or more UGRs, the transitions corresponding to a start point and an end point associated with a particular UGR, and wherein different UGRs within a particular cluster have different start and/or end points with respect to time.
 9. The method of claim 1, wherein the mixing comprises at least one of: a time-frequency domain mixing technique, a segment-wise mixing technique, a target power mixing technique, a maximum component elimination mixing technique, a minimum Variance Mixing (MVM) mixing technique, a segment-wise phase alignment mixing technique, and a maximum component elimination mixing technique.
 10. The method of claim 1, further comprising generating a signal signature for each of the plurality of UGRs, wherein the correlation between the samples of the audio content is determined based at least in part on the signal signatures generated for each of the plurality of UGRs.
 11. A method for mixing media content based on a plurality of user generated recordings (UGRs), the method comprising: receiving a plurality of UGRs, each UGR of the plurality of UGRs comprising at least audio content; generating a signal signature for each of the plurality of UGRs; determining similarities between at least two UGRs of the plurality of UGRs based at least in part on the signal signatures; and estimating a time offset for each of the plurality of UGRs based at least in part on the signal signatures.
 12. The method of claim 11, further comprising: generating one or more clusters based on the similarities between the at least two one or more clusters comprises at least one UGR; generating, for each of the one or more clusters, synchronization data based on the audio content; and mixing, for each of the one or more clusters, media content of the plurality of UGRs based at least in part on the synchronization data to produce mixed content.
 13. The method of claim 12, wherein the synchronization data orders the media content for UGRs within each of the clusters with respect to time.
 14. The method of claim 12, wherein the mixed media content comprises video content.
 15. The method of claim 12, wherein the mixing comprises at least one of: a time-frequency domain mixing technique, a segment-wise mixing technique, a target power mixing technique, a maximum component elimination mixing technique, a minimum Variance Mixing (MVM) mixing technique, a segment-wise phase alignment mixing technique, and a maximum component elimination mixing technique.
 16. The method of claim 15, further comprising normalizing the media content associated with the UGRs to produce normalized media content, wherein the mixing of the media content comprises mixing the normalized media content.
 17. A non-transitory computer-readable storage medium storing instructions that, when executed by one or more processors, cause the one or more processors to perform operations for mixing media content based on a plurality of user generated recordings (UGRs), the operations comprising: receiving a plurality of UGRs, each UGR of the plurality of UGRs comprising at least audio content; generating a signal signature for each of the plurality of UGRs; determining similarities between at least two UGRs of the plurality of UGRs based at least in part on the signal signatures; and estimating a time offset for each of the plurality of UGRs based at least in part on the signal signatures.
 18. The non-transitory computer-readable storage medium of claim 17, the operations further comprising: generating one or more clusters based on the similarities between the at least two one or more clusters comprises at least one UGR; generating, for each of the one or more clusters, synchronization data based on the audio content; and mixing, for each of the one or more clusters, media content of the plurality of UGRs based at least in part on the synchronization data to produce mixed content.
 19. The non-transitory computer-readable storage medium of claim 12, wherein the synchronization data orders the media content for UGRs within each of the clusters with respect to time.
 20. The non-transitory computer-readable storage medium of claim 12, wherein the mixed media content comprises video content. 